Menu
Stuff by Yuki
  • Home
  • Python
  • Power BI
  • Tableau
  • Community
    • Makeover Monday
    • Workout Wednesday
  • About
  • Contact
Stuff by Yuki

Pandas vs Polars – Speed Comparison

Posted on February 20, 2023March 7, 2023

Pandas and Polars

Pandas is probably the most popular library for data manipulation and analysis. If you work in data, I’m sure you have heard about it or you may use it on a daily basis. Pandas is very useful and versatile in various data tasks, however, the main issue that’s being talked about is its performance.

To supplement that weakness, there appeared some other libraries such as Dask, Ray, and Modin. They help speed up your Pandas code quite a bit. Polars is one of those libraries that people say are faster than Pandas. Polars is written in Rust and fairly new in development. There is probably less documentations and resources available on Polars than that of Pandas.

There is a performance test already done comparing Polars against other popular libraries such as Pandas. Here’s the link to the test. But I wanted to test myself on my local machine so that the comparison is more relatable to the general audience.

My Setup

My Laptop’s Specs

I’m using a Mac with the following specs. It’s got 500GB flash storage.

Libraries

I’m using:

  • pandas==1.5.3
  • polars==0.16.2

Data

I’m using a csv file containing 10 million rows with 14 columns, which is about 1.2GB. I used this csv data generator and the exact same command given in the command example in that github repo, with only the number of rows modified. You’d need to have node.js installed on your computer to use that data generator.

I’m also using another csv file containing the same columns with 1 million rows. I use it when I compare the performance of the join operations.

In case you’d like to generate the exact same data to test on your environment, use the following commands:

A csv file with 10 million rows:

Copy Copied Use a different Browser

node generator.js "email,first,last,age,street,city,state,zip,digid,date(3),latitude,longitude,pick(RED|BLUE|YELLOW|GREEN|WHITE),string" 10000000 10million.csv

A csv file with 1 million rows:

Copy Copied Use a different Browser

node generator.js "email,first,last,age,street,city,state,zip,digid,date(3),latitude,longitude,pick(RED|BLUE|YELLOW|GREEN|WHITE),string" 1000000 1million.csv

After having generated csv files, I added the column header manually with the following names:

email,first,last,age,street,city,state,zip,digid,date,latitude,longitude,pick,string

Read a CSV File

Performance

Polars read a csv file about 10 times faster than Pandas.

Code

Polars:

Copy Copied Use a different Browser

def read_polars(file_path):
    '''
    Read a csv with polars
    '''
    df = pl.scan_csv(file_path).collect()
    return df

Pandas:

Copy Copied Use a different Browser

def read_pandas(file_path):
    '''
    Read a csv file with pandas
    '''
    df = pd.read_csv(file_path)
    return df

Simple Aggregations with Groupby

Now we’ll test simple aggregations such as sum and average with groupby.

Performance

Polars is about 3 times faster.

Code

Polars:

Copy Copied Use a different Browser

def agg_polars(pl_df):
    '''
    take a polars df and execute a few aggregate functions
    '''
    q = (
        pl_df.lazy()
        .groupby("state")
        .agg(
            [
                pl.count().alias('cnt'),
                pl.col("age").mean().alias('mean age'),
                pl.col("age").min().alias('min age'),
                pl.col("age").max().alias('max age')
            ]
        )
        .sort("state")
    )
    
    output = q.collect()

    return output

Pandas:

Copy Copied Use a different Browser

def agg_pandas(pd_df):
    '''
    take a pandas df and execute a few aggregate functions
    '''
    output = (
        pd_df
        .groupby('state')['age']
        .agg(            
            ['count', 'mean', 'min', 'max'],    
        )
        .sort_values(by=['state'])
    )

    return output

Window Functions

I added two columns. One is average age over state and another one is age rank within each state.

Performance

Polars is about 8 times faster.

Code

Polars:

Copy Copied Use a different Browser

def window_func_polars(pl_df):
    '''
    new column 1 - avg age per state
    new column 2 - age rank within each state
    '''
    q = (
        pl_df.lazy()
        .select(
            [
                'email',
                'first',
                'last',
                'state',
                'age'
            ]
        )
        .with_columns([
            pl.col('age').mean().over('state').alias('avg_age_per_state'),
            pl.col('age').rank(method='dense').over('state').alias('age_rank')
        ])
        .sort('state')
    )

    output = q.collect()

    return output

Pandas:

Copy Copied Use a different Browser

def window_func_pandas(pd_df):
    '''
    new column 1 - avg age per state
    new column 2 - age rank within each state
    '''
    output = (
        pd_df
        .loc[:, ['email', 'first', 'last', 'state', 'age']]
        .assign(avg_age_per_state=lambda df: df.groupby('state')['age'].transform('mean'))
        .assign(age_rank=lambda df: df.groupby('state')['age'].rank(method='dense'))
        .sort_values(by=['state'])
    )
    
    return output

Join Operations

I chose commonly used join operations, which are inner join, left join, and full outer join.

Performance

Inner join: Polars is about 12 times faster

Left join: Polars is about 6 times faster

Full outer join: Polars is about 7 times faster

Code

Polars:

Copy Copied Use a different Browser

def join_polars(pl_df1, pl_df2, on_column, join_type):
    '''
    a func to join in polars based on inputs
    '''
    output = pl_df1.lazy().join(pl_df2.lazy(), on=on_column, how=join_type).collect()
    return output

Pandas:

Copy Copied Use a different Browser

def join_pandas(pd_df1, pd_df2, on_column, join_type):
    '''
    a func to join in pandas based on inputs
    '''
    output = pd_df1.merge(pd_df2, on=on_column, how=join_type)
    return output

Visualizing the Result

Limitations

First of all, this is not at all an official performance testing for Polars and Pandas!

I didn’t purposely choose the complexity of the data and data format (csv). I chose 10 million rows because that’s about the size where I could say data is big enough and that it’s not small. I chose csv format because it was easy to generate. I can see that if the dataset was a lot smaller, Pandas would be comparable to Polars.

I didn’t consider things like how many tabs in a browser were open, how many software I was running, my internet speed, etc, which may or may not have affected my performance test.

Another thing worth mentioning is that I may or may not have used the most optimal way of doing things both in Polars and Pandas, due to my lack of expertise in each library.

Conclusion

Given the limitations I listed, I still consider Polars a strong alternative to Pandas library (which may be obvious to many of you without my testing!). I’d probably start using Polars more where possible for performance reasons, though, I can definitely see cases when Pandas is more appropriate depending on the project.

Source code: Github repo

2 thoughts on “Pandas vs Polars – Speed Comparison”

  1. Nisheeth Jaiswal says:
    March 5, 2023 at 7:16 pm

    Hello this is nice, pls can u compare the same on the new Pandas 2.0 with PyArrow back end instead of Numpy.

    Reply
    1. Yuki says:
      March 6, 2023 at 11:28 am

      That’s on my list!

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • What are Power BI Dataflows?
  • Calculate the Max Sales Amount for Product
  • Convert DataFrame to Series in Polars
  • Pandas vs Polars – Speed Comparison
  • Polars with DuckDB – Using SQL in Polars

Popular Posts

  • A Running Total Calculation with Quick Measure in Power BI
  • How To Copy And Paste Report Page in Power BI
  • Year-Over-Year Calculation: Time Intelligence in Power BI
  • How to Fill Dates Between Start Date and End Date in Power BI (Power Query)
  • Network Visualizations in Python

connect with me

  • LinkedIn
  • Twitter
  • Github
©2023 Stuff by Yuki | Powered by SuperbThemes & WordPress