
Pandas and Polars
Pandas is probably the most popular library for data manipulation and analysis. If you work in data, I’m sure you have heard about it or you may use it on a daily basis. Pandas is very useful and versatile in various data tasks, however, the main issue that’s being talked about is its performance.
To supplement that weakness, there appeared some other libraries such as Dask, Ray, and Modin. They help speed up your Pandas code quite a bit. Polars is one of those libraries that people say are faster than Pandas. Polars is written in Rust and fairly new in development. There is probably less documentations and resources available on Polars than that of Pandas.
There is a performance test already done comparing Polars against other popular libraries such as Pandas. Here’s the link to the test. But I wanted to test myself on my local machine so that the comparison is more relatable to the general audience.
My Setup
My Laptop’s Specs
I’m using a Mac with the following specs. It’s got 500GB flash storage.

Libraries
I’m using:
- pandas==1.5.3
- polars==0.16.2
Data
I’m using a csv file containing 10 million rows with 14 columns, which is about 1.2GB. I used this csv data generator and the exact same command given in the command example in that github repo, with only the number of rows modified. You’d need to have node.js installed on your computer to use that data generator.
I’m also using another csv file containing the same columns with 1 million rows. I use it when I compare the performance of the join operations.
In case you’d like to generate the exact same data to test on your environment, use the following commands:
A csv file with 10 million rows:
node generator.js "email,first,last,age,street,city,state,zip,digid,date(3),latitude,longitude,pick(RED|BLUE|YELLOW|GREEN|WHITE),string" 10000000 10million.csv
A csv file with 1 million rows:
node generator.js "email,first,last,age,street,city,state,zip,digid,date(3),latitude,longitude,pick(RED|BLUE|YELLOW|GREEN|WHITE),string" 1000000 1million.csv
After having generated csv files, I added the column header manually with the following names:
email,first,last,age,street,city,state,zip,digid,date,latitude,longitude,pick,string
Read a CSV File
Performance

Polars read a csv file about 10 times faster than Pandas.
Code
Polars:
def read_polars(file_path): ''' Read a csv with polars ''' df = pl.scan_csv(file_path).collect() return df
Pandas:
def read_pandas(file_path): ''' Read a csv file with pandas ''' df = pd.read_csv(file_path) return df
Simple Aggregations with Groupby
Now we’ll test simple aggregations such as sum and average with groupby.
Performance

Polars is about 3 times faster.
Code
Polars:
def agg_polars(pl_df): ''' take a polars df and execute a few aggregate functions ''' q = ( pl_df.lazy() .groupby("state") .agg( [ pl.count().alias('cnt'), pl.col("age").mean().alias('mean age'), pl.col("age").min().alias('min age'), pl.col("age").max().alias('max age') ] ) .sort("state") ) output = q.collect() return output
Pandas:
def agg_pandas(pd_df): ''' take a pandas df and execute a few aggregate functions ''' output = ( pd_df .groupby('state')['age'] .agg( ['count', 'mean', 'min', 'max'], ) .sort_values(by=['state']) ) return output
Window Functions
I added two columns. One is average age over state and another one is age rank within each state.
Performance

Polars is about 8 times faster.
Code
Polars:
def window_func_polars(pl_df): ''' new column 1 - avg age per state new column 2 - age rank within each state ''' q = ( pl_df.lazy() .select( [ 'email', 'first', 'last', 'state', 'age' ] ) .with_columns([ pl.col('age').mean().over('state').alias('avg_age_per_state'), pl.col('age').rank(method='dense').over('state').alias('age_rank') ]) .sort('state') ) output = q.collect() return output
Pandas:
def window_func_pandas(pd_df): ''' new column 1 - avg age per state new column 2 - age rank within each state ''' output = ( pd_df .loc[:, ['email', 'first', 'last', 'state', 'age']] .assign(avg_age_per_state=lambda df: df.groupby('state')['age'].transform('mean')) .assign(age_rank=lambda df: df.groupby('state')['age'].rank(method='dense')) .sort_values(by=['state']) ) return output
Join Operations
I chose commonly used join operations, which are inner join, left join, and full outer join.
Performance

Inner join: Polars is about 12 times faster
Left join: Polars is about 6 times faster
Full outer join: Polars is about 7 times faster
Code
Polars:
def join_polars(pl_df1, pl_df2, on_column, join_type): ''' a func to join in polars based on inputs ''' output = pl_df1.lazy().join(pl_df2.lazy(), on=on_column, how=join_type).collect() return output
Pandas:
def join_pandas(pd_df1, pd_df2, on_column, join_type): ''' a func to join in pandas based on inputs ''' output = pd_df1.merge(pd_df2, on=on_column, how=join_type) return output
Visualizing the Result

Limitations
First of all, this is not at all an official performance testing for Polars and Pandas!
I didn’t purposely choose the complexity of the data and data format (csv). I chose 10 million rows because that’s about the size where I could say data is big enough and that it’s not small. I chose csv format because it was easy to generate. I can see that if the dataset was a lot smaller, Pandas would be comparable to Polars.
I didn’t consider things like how many tabs in a browser were open, how many software I was running, my internet speed, etc, which may or may not have affected my performance test.
Another thing worth mentioning is that I may or may not have used the most optimal way of doing things both in Polars and Pandas, due to my lack of expertise in each library.
Conclusion
Given the limitations I listed, I still consider Polars a strong alternative to Pandas library (which may be obvious to many of you without my testing!). I’d probably start using Polars more where possible for performance reasons, though, I can definitely see cases when Pandas is more appropriate depending on the project.
Source code: Github repo
Hello this is nice, pls can u compare the same on the new Pandas 2.0 with PyArrow back end instead of Numpy.
That’s on my list!