Aggregations such as sum and mean are frequently used in data science and analysis. There are cases where you might want to aggregate values over columns instead of rows. Meaning that if you have two columns A and B, you want to aggregate values on horizontally (columns), not vertically (rows).
Polars allows this type of aggregation just like other dataframe libraries.
I’ll be demonstrating how to do this by using a pokemon dataset. This dataset is also used in Polars documentation.
You can find the full code in my GitHub repo.
Aggregate Over Multiple Columns in Polars
In this example, I’m calculating the sum of HP, Attack, Defense, Sp. Atk, Sp. Def, Speed. It’s essentially getting the value of “Total” column.
The dataset looks like this:
import polars as pl
# get data
FILE_PATH = "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
df = pl.read_csv(FILE_PATH)
print(df.head())
"""
shape: (5, 13)
┌─────┬───────────────────────┬────────┬────────┬───┬─────────┬───────┬────────────┬───────────┐
│ # ┆ Name ┆ Type 1 ┆ Type 2 ┆ … ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ bool │
╞═════╪═══════════════════════╪════════╪════════╪═══╪═════════╪═══════╪════════════╪═══════════╡
│ 1 ┆ Bulbasaur ┆ Grass ┆ Poison ┆ … ┆ 65 ┆ 45 ┆ 1 ┆ false │
│ 2 ┆ Ivysaur ┆ Grass ┆ Poison ┆ … ┆ 80 ┆ 60 ┆ 1 ┆ false │
│ 3 ┆ Venusaur ┆ Grass ┆ Poison ┆ … ┆ 100 ┆ 80 ┆ 1 ┆ false │
│ 3 ┆ VenusaurMega Venusaur ┆ Grass ┆ Poison ┆ … ┆ 120 ┆ 80 ┆ 1 ┆ false │
│ 4 ┆ Charmander ┆ Fire ┆ null ┆ … ┆ 50 ┆ 65 ┆ 1 ┆ false │
└─────┴───────────────────────┴────────┴────────┴───┴─────────┴───────┴────────────┴───────────┘
"""
For the calculation, I’d need to sum values across columns. There are 3 ways to accomplish this.
- Using pl.DataFrame.sum() with its parameter axis set to 1.
- Using pl.sum_horizontal (Polars version 0.18.8 or newer).
- Using pl.DataFrame.fold().
Using df.sum(axis=1)
This way of aggregating values over columns exists in a library like pandas as well.
pl.all() is for selecting all the columns and exclude() does the job of excluding specific columns. Another option is to specify the columns that you want to aggregate values over.
# aggregation over multiple columns
agg_col = (
df
.select(
pl.all().exclude('#', 'Name', 'Type 1', 'Type 2', 'Generation', 'Legendary', 'Total')
# pl.col(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']) # another option
)
.sum(axis=1)
.alias('agg over cols')
)
print(agg_col)
"""
shape: (163,)
Series: 'agg over cols' [i64]
[
318
405
525
625
309
405
534
634
634
314
405
530
…
]
"""
Just calculating aggregations over multiple columns is not too difficult. But adding that as a new column makes things a bit tricky in Polars or you need to understand how expressions work in Polars. In pandas, adding this type of aggregation is as easy as adding a new column like this:
df['new column'] = df.sum(axis=1)
But you won’t do this way of column assignment in Polars. You would use with_columns() or select().
Even though the new column is calculating the values at the dataframe level, you can pass that result/output (Polars series) into a method. This allows us to still add other columns as needed as well (I added a window function in the code for demonstration).
with_columns()
# add the calculation above as a new column
# option 1 - with_columns
df_1 = (
df
.with_columns(
agg_col,
pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
)
)
print(df_1.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))
"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur ┆ Grass ┆ 318 ┆ 318 ┆ 66.153846 │
│ Ivysaur ┆ Grass ┆ 405 ┆ 405 ┆ 66.153846 │
│ Venusaur ┆ Grass ┆ 525 ┆ 525 ┆ 66.153846 │
│ VenusaurMega Venusaur ┆ Grass ┆ 625 ┆ 625 ┆ 66.153846 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ Charizard ┆ Fire ┆ 534 ┆ 534 ┆ 65.857143 │
│ CharizardMega Charizard X ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ CharizardMega Charizard Y ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ Squirtle ┆ Water ┆ 314 ┆ 314 ┆ 66.967742 │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""
select()
# option 2 - select
df_2 = (
df
.select(
pl.col('*'),
agg_col,
pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
)
)
print(df_2.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))
"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur ┆ Grass ┆ 318 ┆ 318 ┆ 66.153846 │
│ Ivysaur ┆ Grass ┆ 405 ┆ 405 ┆ 66.153846 │
│ Venusaur ┆ Grass ┆ 525 ┆ 525 ┆ 66.153846 │
│ VenusaurMega Venusaur ┆ Grass ┆ 625 ┆ 625 ┆ 66.153846 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ Charizard ┆ Fire ┆ 534 ┆ 534 ┆ 65.857143 │
│ CharizardMega Charizard X ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ CharizardMega Charizard Y ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ Squirtle ┆ Water ┆ 314 ┆ 314 ┆ 66.967742 │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""
Using pl.sum_horizontal()
This is a clearer way of aggregating over columns than using df.sum(axis=1). This should be self-explanatory. You just specify columns you want to aggregate over. There is other variations like min, max, and even cumsom for horizontal aggregations.
df = (
df
.with_columns(
pl.sum_horizontal(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']).alias('agg over cols'),
pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
)
)
print(df.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))
"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur ┆ Grass ┆ 318 ┆ 318 ┆ 66.153846 │
│ Ivysaur ┆ Grass ┆ 405 ┆ 405 ┆ 66.153846 │
│ Venusaur ┆ Grass ┆ 525 ┆ 525 ┆ 66.153846 │
│ VenusaurMega Venusaur ┆ Grass ┆ 625 ┆ 625 ┆ 66.153846 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ Charizard ┆ Fire ┆ 534 ┆ 534 ┆ 65.857143 │
│ CharizardMega Charizard X ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ CharizardMega Charizard Y ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ Squirtle ┆ Water ┆ 314 ┆ 314 ┆ 66.967742 │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""
Using pl.fold()
pl.fold() can be used for more complicated horizontal aggregations. But in this one, we’re just calculating the sum.
There are 3 parameters, acc, function, and exprs. “acc” stands for accumulator expression, which will be initialized when the calculation starts. For example, if you put pl.lit((1) instead of pl.lit(0) in the code below, then the resulting values will be added by 1.
“function” is specifying what you’re applying to values in each column. In my code, it has “acc” and “x”. In order to understand what they mean, you can think of them as “series1” and “series2”. So you’re defining what operation you want to apply to the subsequent series/columns.
“exprs” is the expressions to specify columns you aggregate over. You can even add a condition like pl.col(*) > 120 to filter out some data.
df = (
df
.with_columns(
pl.fold(acc=pl.lit(0), function=lambda acc, x: acc + x, exprs=pl.col(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed'])).alias('agg over cols'),
pl.col('HP').mean().over('Type 1').alias('Avg HP Per Type 1')
)
)
print(df.select(pl.col(['Name', 'Type 1', 'Total', 'agg over cols', 'Avg HP Per Type 1']).head()))
"""
shape: (10, 5)
┌───────────────────────────┬────────┬───────┬───────────────┬───────────────────┐
│ Name ┆ Type 1 ┆ Total ┆ agg over cols ┆ Avg HP Per Type 1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═══════════════════════════╪════════╪═══════╪═══════════════╪═══════════════════╡
│ Bulbasaur ┆ Grass ┆ 318 ┆ 318 ┆ 66.153846 │
│ Ivysaur ┆ Grass ┆ 405 ┆ 405 ┆ 66.153846 │
│ Venusaur ┆ Grass ┆ 525 ┆ 525 ┆ 66.153846 │
│ VenusaurMega Venusaur ┆ Grass ┆ 625 ┆ 625 ┆ 66.153846 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ Charizard ┆ Fire ┆ 534 ┆ 534 ┆ 65.857143 │
│ CharizardMega Charizard X ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ CharizardMega Charizard Y ┆ Fire ┆ 634 ┆ 634 ┆ 65.857143 │
│ Squirtle ┆ Water ┆ 314 ┆ 314 ┆ 66.967742 │
└───────────────────────────┴────────┴───────┴───────────────┴───────────────────┘
"""
Summary
Hope this post helps you learn how to aggregate over multiple columns in Polars. Polars expressions are powerful. Once you understand its basics, you’ll be able to take advantage of its flexibility.
References
- https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.sum.html
- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.all.html
- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.exclude.html
- https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.with_columns.html
- https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.select.html
- https://pola-rs.github.io/polars-book/user-guide/expressions/folds/
- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.fold.html
- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.sum_horizontal.html