Checking and filling missing values is an important piece in data science and analytics projects. A popular dataframe library, pandas, provides a method like fillna(), for example.
Polars has built-in methods and expressions to work with missing values as well. This post covers ways to check missing values as well as ways how you can fill missing values.
All of my code is available in my GitHub repo.
Null vs NaN
It’s good to go over the difference between “Null” values and “NaN” values. First off, “NaN” stands for “Not a Number”. It’s used to represent missing values in a column of float data type. It’s technically not considered as missing values in Polars. If you’re familiar with pandas or numpy, np.NaN will be treated as NaN in Polars.
On the other hand, “Null” values are just the general null values that you think of. It’s simply a pure missing value, regardless of column’s data type. And null values and NaN values can exist in the same column. A side note, Python’s “None” values are treated as nulls in Polars.
One important thing to remember is that “Null” and “NaN” values are treated differently in Polars. For instance, if you fill null values using fill_null(), that doesn’t fill NaN values, and vice-versa.
Check Missing Values in Polars
Let’s say we have a dataframe that looks like this:
import polars as pl
df = pl.DataFrame({
'Letter': ['A', 'B', 'C', 'D', 'F', None, 'H', 'I', 'J'],
'Value': [1.0, 2, 3, None, 5, 6, 7, None, float("nan")]
})
"""
shape: (9, 2)
┌────────┬───────┐
│ Letter ┆ Value │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪═══════╡
│ A ┆ 1.0 │
│ B ┆ 2.0 │
│ C ┆ 3.0 │
│ D ┆ null │
│ F ┆ 5.0 │
│ null ┆ 6.0 │
│ H ┆ 7.0 │
│ I ┆ null │
│ J ┆ NaN │
└────────┴───────┘
"""
It contains both None values as well as a NaN value.
One of the things you could use to check missing values is .null_count(). This is available both for Polars dataframe and series.
print(df.null_count()) # null_count on the whole dataframe
print(df.select(pl.col('Value').null_count())) # null_count on a column
"""
shape: (1, 2)
┌────────┬───────┐
│ Letter ┆ Value │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞════════╪═══════╡
│ 1 ┆ 2 │
└────────┴───────┘
shape: (1, 1)
┌───────┐
│ Value │
│ --- │
│ u32 │
╞═══════╡
│ 2 │
└───────┘
"""
As you can see, .null_count() only counts null values, ignoring NaN values.
There doesn’t exist a method like .null_count() for NaN values yet as of today, but there is one similar to .is_null(), which is .is_nan(). You can use it to get the count of NaN values with the code like the followings:
print(
df
.filter(pl.col('Value').is_nan())
.select(pl.count())
)
"""
shape: (1, 1)
┌───────┐
│ count │
│ --- │
│ u32 │
╞═══════╡
│ 2 │
└───────┘
"""
print(
df
.filter(pl.col('Value').is_nan())
.shape[0]
)
"""
1
"""
On a side note, you can also use .is_not_null() and .is_not_nan() to get the exact opposite result of .is_null() and .is_nan()
Fill Missing Values in Polars
For filling missing values, there are methods available for both Null values and NaN values using .fill_null() and .fill_nan(). You can use them on a dataframe level as well as column level. I will demonstrate how to use .fill_null() only as the functionality is the same for .fill_na().
There are 3 ways you can fill missing values:
- Using literal values
- Using a strategy
- Using Expression(s)
Using Literal Value
It pretty much explains itself. You specify a fixed value and Null values are replace with that value.
print(
df.select(pl.col('Value').fill_null(100))
)
"""
shape: (9, 1)
┌───────┐
│ Value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ 2.0 │
│ 3.0 │
│ 100.0 │
│ 5.0 │
│ 6.0 │
│ 7.0 │
│ 100.0 │
│ NaN │
└───────┘
"""
Using a Strategy
If you’ve used pandas to fill missing values, you might be familiar with strategies such as backfill and forward fill. Similar options are available in Polars as well. The current list is this: ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’.
# with forward strategy
print(
df.select(pl.col('Value').fill_null(strategy='forward'))
)
"""
shape: (9, 1)
┌───────┐
│ Value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ 2.0 │
│ 3.0 │
│ 3.0 │
│ 5.0 │
│ 6.0 │
│ 7.0 │
│ 7.0 │
│ NaN │
└───────┘
"""
# with backward strategy
print(
df.select(pl.col('Value').fill_null(strategy='backward'))
)
"""
shape: (9, 1)
┌───────┐
│ Value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ 2.0 │
│ 3.0 │
│ 5.0 │
│ 5.0 │
│ 6.0 │
│ 7.0 │
│ NaN │
│ NaN │
└───────┘
"""
Using Expressions
This is where Polars shines. You can use your own expression to fill missing values. Your options won’t be limited to the list of fill strategies available. In this example, I’m calculating max and min of ‘Value’ column and multiple that by 10. You probably won’t do this in real world but this is to demonstrate a way to fill missing values using expressions
# with expressions
print(
df.select(pl.col('Value').fill_null(
(pl.col('Value').max() - pl.col('Value').min()) * 10
))
)
"""
shape: (9, 1)
┌───────┐
│ Value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ 2.0 │
│ 3.0 │
│ 60.0 │
│ 5.0 │
│ 6.0 │
│ 7.0 │
│ 60.0 │
│ NaN │
└───────┘
"""
Drop Missing Values
Besides filling missing values, one other thing you can do is to drop rows that contain missing values.
Polars has this method only for both Null values as of today. The method .drop_nulls() is available both for dataframe and lazyframe.
If you want to drop rows that contain NaN values then you’d need to fill NaN values with Null values first, and then apply .drop_nulls().
# drop nulls values
print(df.drop_nulls())
"""
shape: (6, 2)
┌────────┬───────┐
│ Letter ┆ Value │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪═══════╡
│ A ┆ 1.0 │
│ B ┆ 2.0 │
│ C ┆ 3.0 │
│ F ┆ 5.0 │
│ H ┆ 7.0 │
│ J ┆ NaN │
└────────┴───────┘
"""
# drop nan values, after converting them to nulls
print(df.fill_nan(None).drop_nulls())
"""
shape: (5, 2)
┌────────┬───────┐
│ Letter ┆ Value │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪═══════╡
│ A ┆ 1.0 │
│ B ┆ 2.0 │
│ C ┆ 3.0 │
│ F ┆ 5.0 │
│ H ┆ 7.0 │
└────────┴───────┘
"""
Summary
I hope this blog post helps you learn how to work with and fill missing values in Polars. Knowing how to handle missing values is beneficial as a data analytics/science professional.