Have you ever need to calculate the percent of total for your data? I bet you’ve encountered situations like that many times. It’s a common analysis, that if you know how to do, would benefit you as a data professional.
In this blog post, I’ll demonstrate how to do that in Polars. Here’s the link to the GitHub repo.
Calculate Percent of Total in Polars
In this example, my code calculates the percent of total count per particular grain. What you need is the count of rows for each value of a particular column as well as the total row count. And you divide the former by the latter. polars.count() comes in useful here as it gives you the row count in a specific context. There may be some other ways I have not found yet, but what I’ll demonstrate does the job.
Let’s say you have a dataframe like the following:
import polars as pl
df = pl.LazyFrame(
{
'Letter': ['A','A','A','B','B','C','D','D','D','D','D','E','F','G'],
'Value': [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
}
)
One way to get the percent of total count, is to calculate the row count for each value of a column first, and then add a column calculating the total count. And finally do the division using those count columns.
df1 = (
df
.groupby('Letter')
.agg(
pl.count().alias('Cnt Per Letter')
)
.with_columns(
(
(pl.col('Cnt Per Letter') / pl.col('Cnt Per Letter').sum()).round(2) * 100
)
.cast(pl.UInt64).alias('Percent of Total')
)
.sort('Letter')
)
print(df1.collect())
'''
output:
shape: (7, 3)
┌────────┬────────────────┬──────────────────┐
│ Letter ┆ Cnt Per Letter ┆ Percent of Total │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u64 │
╞════════╪════════════════╪══════════════════╡
│ A ┆ 3 ┆ 21 │
│ B ┆ 2 ┆ 14 │
│ C ┆ 1 ┆ 7 │
│ D ┆ 5 ┆ 36 │
│ E ┆ 1 ┆ 7 │
│ F ┆ 1 ┆ 7 │
│ G ┆ 1 ┆ 7 │
└────────┴────────────────┴──────────────────┘
'''
Another way I came up with is to use a window function to do the same.
df2 = (
df
.with_columns(
pl.col('Value').count().over('Letter').alias('Cnt Per Letter'),
pl.count().alias('Total Cnt')
)
.with_columns(
(
(pl.col('Cnt Per Letter') / pl.col('Total Cnt')).round(2) * 100
)
.cast(pl.UInt64).alias('Percent of Total')
)
.unique('Letter')
.select(['Letter', 'Cnt Per Letter', 'Percent of Total'])
.sort('Letter')
)
print(df2.collect())
'''
output:
shape: (7, 3)
┌────────┬────────────────┬──────────────────┐
│ Letter ┆ Cnt Per Letter ┆ Percent of Total │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u64 │
╞════════╪════════════════╪══════════════════╡
│ A ┆ 3 ┆ 21 │
│ B ┆ 2 ┆ 14 │
│ C ┆ 1 ┆ 7 │
│ D ┆ 5 ┆ 36 │
│ E ┆ 1 ┆ 7 │
│ F ┆ 1 ┆ 7 │
│ G ┆ 1 ┆ 7 │
└────────┴────────────────┴──────────────────┘
'''
Summary
There are ways to calculate percent of total in Polars as well as in other tools. I hope this post gives an idea of how you can do that and apply it in other analytics use cases using Polars.
References
- https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.count.html
- https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/groupby.html
- https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.with_columns.html
- https://stackoverflow.com/questions/75080918/possible-to-calculate-counts-and-percentage-in-one-chain-using-polars
- https://stackoverflow.com/questions/72747398/best-way-to-get-percentage-counts-in-polars