Menu
Stuff by Yuki
  • Home
  • Data Engineering
    • Python
  • Business Intelligence
    • Power BI
    • Tableau
  • Perspectives
  • About
  • Contact
Stuff by Yuki

Aggregations Over Multiple Columns in Polars

Posted on June 30, 2023July 22, 2023
Image by Pablo Hermoso on Unsplash

Aggregations such as sum and mean are frequently used in data science and analysis. There are cases where you might want to aggregate values over columns instead of rows. Meaning that if you have two columns A and B, you want to aggregate values on horizontally (columns), not vertically (rows).

Polars allows this type of aggregation just like other dataframe libraries.

I’ll be demonstrating how to do this by using a pokemon dataset. This dataset is also used in Polars documentation.

You can find the full code in my GitHub repo.

Aggregate Over Multiple Columns in Polars

In this example, I’m calculating the sum of HP, Attack, Defense, Sp. Atk, Sp. Def, Speed. It’s essentially getting the value of “Total” column.

The dataset looks like this:

Copy Copied Use a different Browser

import polars as pl

# get data
FILE_PATH = "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
df = pl.read_csv(FILE_PATH)
print(df.head())
"""
shape: (5, 13)
┌─────┬───────────────────────┬────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1 ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---    ┆ ---    ┆   ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ str    ┆ str    ┆   ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass  ┆ Poison ┆ … ┆ 65      ┆ 45    ┆ 1          ┆ false     │
│ 2   ┆ Ivysaur               ┆ Grass  ┆ Poison ┆ … ┆ 80      ┆ 60    ┆ 1          ┆ false     │
│ 3   ┆ Venusaur              ┆ Grass  ┆ Poison ┆ … ┆ 100     ┆ 80    ┆ 1          ┆ false     │
│ 3   ┆ VenusaurMega Venusaur ┆ Grass  ┆ Poison ┆ … ┆ 120     ┆ 80    ┆ 1          ┆ false     │
│ 4   ┆ Charmander            ┆ Fire   ┆ null   ┆ … ┆ 50      ┆ 65    ┆ 1          ┆ false     │
└─────┴───────────────────────┴────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘
"""

For the calculation, I’d need to sum values across columns. There are 3 ways to accomplish this.

  • Using pl.DataFrame.sum() with its parameter axis set to 1.
  • Using pl.sum_horizontal (Polars version 0.18.8 or newer).
  • Using pl.DataFrame.fold().

Using df.sum(axis=1)

This way of aggregating values over columns exists in a library like pandas as well.

pl.all() is for selecting all the columns and exclude() does the job of excluding specific columns. Another option is to specify the columns that you want to aggregate values over.

Copy Copied Use a different Browser

# aggregation over multiple columns
agg_col = (
    df
    .select(
        pl.all().exclude('#', 'Name', 'Type 1', 'Type 2', 'Generation', 'Legendary', 'Total')
        # pl.col(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed'])  # another option

    )
    .sum(axis=1)
    .alias('agg over cols')
)
print(agg_col)
"""
shape: (163,)
Series: 'agg over cols' [i64]
[
        318
        405
        525
        625
        309
        405
        534
        634
        634
        314
        405
        530
        …
]
"""

Just calculating aggregations over multiple columns is not too difficult. But adding that as a new column makes things a bit tricky in Polars or you need to understand how expressions work in Polars. In pandas, adding this type of aggregation is as easy as adding a new column like this:

df['new column'] = df.sum(axis=1)

But you won’t do this way of column assignment in Polars. You would use with_columns() or select().

Even though the new column is calculating the values at the dataframe level, you can pass that result/output (Polars series) into a method. This allows us to still add other columns as needed as well (I added a window function in the code for demonstration).

with_columns()

Copy Copied Use a different Browser

# add the calculation above as a new column
# option 1 - with_columns
df_1 = (
    df
    .with_columns(
        agg_col,
        pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
    )
)
print(df_1.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))

"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name                      ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ ---                       ┆ ---    ┆ ---   ┆ ---           ┆ ---               │
│ str                       ┆ str    ┆ i64   ┆ i64           ┆ f64               │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur                 ┆ Grass  ┆ 318   ┆ 318           ┆ 66.153846         │
│ Ivysaur                   ┆ Grass  ┆ 405   ┆ 405           ┆ 66.153846         │
│ Venusaur                  ┆ Grass  ┆ 525   ┆ 525           ┆ 66.153846         │
│ VenusaurMega Venusaur     ┆ Grass  ┆ 625   ┆ 625           ┆ 66.153846         │
│ …                         ┆ …      ┆ …     ┆ …             ┆ …                 │
│ Charizard                 ┆ Fire   ┆ 534   ┆ 534           ┆ 65.857143         │
│ CharizardMega Charizard X ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ CharizardMega Charizard Y ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ Squirtle                  ┆ Water  ┆ 314   ┆ 314           ┆ 66.967742         │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""

select()

Copy Copied Use a different Browser

# option 2 - select
df_2 = (
    df
    .select(
        pl.col('*'),
        agg_col,
        pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
    )
)
print(df_2.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))
"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name                      ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ ---                       ┆ ---    ┆ ---   ┆ ---           ┆ ---               │
│ str                       ┆ str    ┆ i64   ┆ i64           ┆ f64               │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur                 ┆ Grass  ┆ 318   ┆ 318           ┆ 66.153846         │
│ Ivysaur                   ┆ Grass  ┆ 405   ┆ 405           ┆ 66.153846         │
│ Venusaur                  ┆ Grass  ┆ 525   ┆ 525           ┆ 66.153846         │
│ VenusaurMega Venusaur     ┆ Grass  ┆ 625   ┆ 625           ┆ 66.153846         │
│ …                         ┆ …      ┆ …     ┆ …             ┆ …                 │
│ Charizard                 ┆ Fire   ┆ 534   ┆ 534           ┆ 65.857143         │
│ CharizardMega Charizard X ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ CharizardMega Charizard Y ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ Squirtle                  ┆ Water  ┆ 314   ┆ 314           ┆ 66.967742         │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""

Using pl.sum_horizontal()

This is a clearer way of aggregating over columns than using df.sum(axis=1). This should be self-explanatory. You just specify columns you want to aggregate over. There is other variations like min, max, and even cumsom for horizontal aggregations.

Copy Copied Use a different Browser

df = (
    df
    .with_columns(
        pl.sum_horizontal(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']).alias('agg over cols'),
        pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
    )
)
print(df.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))
"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name                      ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ ---                       ┆ ---    ┆ ---   ┆ ---           ┆ ---               │
│ str                       ┆ str    ┆ i64   ┆ i64           ┆ f64               │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur                 ┆ Grass  ┆ 318   ┆ 318           ┆ 66.153846         │
│ Ivysaur                   ┆ Grass  ┆ 405   ┆ 405           ┆ 66.153846         │
│ Venusaur                  ┆ Grass  ┆ 525   ┆ 525           ┆ 66.153846         │
│ VenusaurMega Venusaur     ┆ Grass  ┆ 625   ┆ 625           ┆ 66.153846         │
│ …                         ┆ …      ┆ …     ┆ …             ┆ …                 │
│ Charizard                 ┆ Fire   ┆ 534   ┆ 534           ┆ 65.857143         │
│ CharizardMega Charizard X ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ CharizardMega Charizard Y ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ Squirtle                  ┆ Water  ┆ 314   ┆ 314           ┆ 66.967742         │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""

Using pl.fold()

pl.fold() can be used for more complicated horizontal aggregations. But in this one, we’re just calculating the sum.

There are 3 parameters, acc, function, and exprs. “acc” stands for accumulator expression, which will be initialized when the calculation starts. For example, if you put pl.lit((1) instead of pl.lit(0) in the code below, then the resulting values will be added by 1.

“function” is specifying what you’re applying to values in each column. In my code, it has “acc” and “x”. In order to understand what they mean, you can think of them as “series1” and “series2”. So you’re defining what operation you want to apply to the subsequent series/columns.

“exprs” is the expressions to specify columns you aggregate over. You can even add a condition like pl.col(*) > 120 to filter out some data.

Copy Copied Use a different Browser

df = (
    df
    .with_columns(
        pl.fold(acc=pl.lit(0), function=lambda acc, x: acc + x, exprs=pl.col(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed'])).alias('agg over cols'),
        pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
    )
)
print(df.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))
"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name                      ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ ---                       ┆ ---    ┆ ---   ┆ ---           ┆ ---               │
│ str                       ┆ str    ┆ i64   ┆ i64           ┆ f64               │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur                 ┆ Grass  ┆ 318   ┆ 318           ┆ 66.153846         │
│ Ivysaur                   ┆ Grass  ┆ 405   ┆ 405           ┆ 66.153846         │
│ Venusaur                  ┆ Grass  ┆ 525   ┆ 525           ┆ 66.153846         │
│ VenusaurMega Venusaur     ┆ Grass  ┆ 625   ┆ 625           ┆ 66.153846         │
│ …                         ┆ …      ┆ …     ┆ …             ┆ …                 │
│ Charizard                 ┆ Fire   ┆ 534   ┆ 534           ┆ 65.857143         │
│ CharizardMega Charizard X ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ CharizardMega Charizard Y ┆ Fire   ┆ 634   ┆ 634           ┆ 65.857143         │
│ Squirtle                  ┆ Water  ┆ 314   ┆ 314           ┆ 66.967742         │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""

Summary

Hope this post helps you learn how to aggregate over multiple columns in Polars. Polars expressions are powerful. Once you understand its basics, you’ll be able to take advantage of its flexibility.

References

  • https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.sum.html
  • https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.all.html
  • https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.exclude.html
  • https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.with_columns.html
  • https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.select.html
  • https://pola-rs.github.io/polars-book/user-guide/expressions/folds/
  • https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.fold.html
  • https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.sum_horizontal.html

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Where I’m Headed in the Next 5 Years
  • Open-Source vs Vendor Data Tools
  • Developing the Habit of Writing
  • How to Inspect and Optimize Query Plans in Python Polars
  • Learn Python Polars with Polars Cookbook

Popular Posts

  • A Running Total Calculation with Quick Measure in Power BI
  • A Complete Guide to Git Integration in Power BI
  • How To Copy And Paste Report Page in Power BI
  • Handling Missing Values in Polars
  • How to Convert String to Date or Datetime in Polars

connect with me

  • LinkedIn
  • Twitter
  • Github
  • Website

Search Articles

©2025 Stuff by Yuki | Powered by SuperbThemes