Polars and DuckDB
We see that Polars and DuckDB seem to be the same kind of tools, where we can use them for data analytics, data science, and data engineering tasks, with great performance. I’m not going into which one is better than the other, but I’d like to show you a good way to use DuckDB to compliment Polars.
How to Use SQL in Polars
When it comes to complex data wrangling tasks, I’m personally more conformable doing them in SQL because I’ve used it more than anything else in production environment. Also, at many companies, these tasks are usually done in tools such as dbt and other ELT/ETL tools that utilize SQL. Though there are cases I see implementing transformation logic in Python or other tools is easier than in SQL. But it all depends on the context of your project.
Why use SQL in Polars? I really don’t have to, but I was playing with Polars the other day, I thought, “It would be great if I can test the output in SQL to double check the validity of my calculations I did in Polars”, given that I’m fairly new to Polars.
In fact, there is a method read_sql() that you can read from a database in Polars, but to apply further data transformations, you’d need to use whatever is available in Polars. You can actually run SQL syntax in Polars. Luca shared a tip on his LinkedIn post the other way that I didn’t know existed! I wrote how to do that in this blog post.
Use SQL with DuckDB in Polars
I researched ways to use SQL in Polars and I found that you can use DuckDB to use SQL and convert back and force between Polars dataframe and Apache Arrow format, with which you can write a query in DuckDB. It’s already introduced in this DuckDB documentation, but I’ll reiterate how it can be done in this post. EDIT: That DuckDB documentation has been updated and now it explains how to use SQL with DuckDB directly on Polars dataframe. Go to the section called “Use SQL with DuckDB Directly on Polars DataFrame” towards the end of this post.
If you’re following along, you can get the data and source code from this github repo. Please make sure you install duckdb and polars Python libraries before you move on.
First you need to read in your data. Check out another blog post on how to read csv files in Polars if you’d like. After doing so, you need to convert it to Apache Arrow format. This is done by using .to_arrow() method available in Polars.
import polars as pl
import duckdb
# read in data
df_pl = pl.scan_csv('../sample_data.csv').collect()
df_arrow = df_pl.to_arrow()
print(df_arrow)
'''
pyarrow.Table
studentId: int64
Name: large_string
Age: int64
FirstEnrolledDate: large_string
----
studentId: [[1,2,3]]
Name: [["Mike","Sarah","John"]]
Age: [[24,33,19]]
FirstEnrolledDate: [["2020-01-17","2021-07-23","2022-12-20"]]
'''
As you can see, the output of arrow formatted data is not as readable as a dataframe.
From here, I’ll apply some transformations in both Polars and DuckDB to see how that can be done.
Firstly, let’s do a simple transformation where we just filter the dataframe.
# filtering in polars
df_pl = (
df_pl
.filter(pl.col('Name')=='Mike')
)
print(df_pl)
'''
shape: (1, 4)
┌───────────┬──────┬─────┬───────────────────┐
│ studentId ┆ Name ┆ Age ┆ FirstEnrolledDate │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞═══════════╪══════╪═════╪═══════════════════╡
│ 1 ┆ Mike ┆ 24 ┆ 2020-01-17 │
└───────────┴──────┴─────┴───────────────────┘
'''
# filtering in duckdb
df_duckdb = duckdb.query("""
select
*
from df_arrow
where Name = 'Mike'
""")
print(df_duckdb)
'''
---------------------
--- Relation Tree ---
---------------------
Subquery
---------------------
-- Result Columns --
---------------------
- studentId (BIGINT)
- Name (VARCHAR)
- Age (BIGINT)
- FirstEnrolledDate (VARCHAR)
---------------------
-- Result Preview --
---------------------
studentId Name Age FirstEnrolledDate
BIGINT VARCHAR BIGINT VARCHAR
[ Rows: 1]
1 Mike 24 2020-01-17
'''
In using DuckDB, you specify the name of your dataframe in your SQL query directly.
One thing I noticed is that DuckDB’s output has too verbose in my opinion, but we can see the result in a table format at least.
Now, let’s apply a simple window function both in Polars and DuckDB.
RANK() in Polars – Window Function
There is an existing rank() method in Polars. We add a new column that ranks each row based on enrolled date.
# a window function in polars
df_pl_rank = (
df_pl
.with_columns([
pl.col('FirstEnrolledDate').rank().cast(pl.Int64).alias('EnrolledRank')
])
)
print(df_pl_rank)
'''
shape: (3, 5)
┌───────────┬───────┬─────┬───────────────────┬──────────────┐
│ studentId ┆ Name ┆ Age ┆ FirstEnrolledDate ┆ EnrolledRank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str ┆ i64 │
╞═══════════╪═══════╪═════╪═══════════════════╪══════════════╡
│ 1 ┆ Mike ┆ 24 ┆ 2020-01-17 ┆ 1 │
│ 2 ┆ Sarah ┆ 33 ┆ 2021-07-23 ┆ 2 │
│ 3 ┆ John ┆ 19 ┆ 2022-12-20 ┆ 3 │
└───────────┴───────┴─────┴───────────────────┴──────────────┘
'''
# a window function in duckdb
df_duckdb_rank = duckdb.query("""
select
*,
rank() over(order by FirstEnrolledDate) as EnrolledRank
from df_arrow
""")
print(df_duckdb_rank)
'''
---------------------
--- Relation Tree ---
---------------------
Subquery
---------------------
-- Result Columns --
---------------------
- studentId (BIGINT)
- Name (VARCHAR)
- Age (BIGINT)
- FirstEnrolledDate (VARCHAR)
- EnrolledRank (BIGINT)
---------------------
-- Result Preview --
---------------------
studentId Name Age FirstEnrolledDate EnrolledRank
BIGINT VARCHAR BIGINT VARCHAR BIGINT
[ Rows: 3]
1 Mike 24 2020-01-17 1
2 Sarah 33 2021-07-23 2
3 John 19 2022-12-20 3
'''
Convert DuckDB to Polars DataFrame
In order to convert DuckDB query result to Polars’ dataframe, you’d need to convert the query to arrow format and to Polars dataframe.
You essentially use .arrow() and wrap it around pl.DataFrame() to convert the arrow format to Polars dataframe.
# convert duckdb to polars dataframe
duckdb_to_polars = pl.DataFrame(df_duckdb_rank.arrow())
print(duckdb_to_polars)
'''
shape: (3, 5)
┌───────────┬───────┬─────┬───────────────────┬──────────────┐
│ studentId ┆ Name ┆ Age ┆ FirstEnrolledDate ┆ EnrolledRank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str ┆ i64 │
╞═══════════╪═══════╪═════╪═══════════════════╪══════════════╡
│ 1 ┆ Mike ┆ 24 ┆ 2020-01-17 ┆ 1 │
│ 2 ┆ Sarah ┆ 33 ┆ 2021-07-23 ┆ 2 │
│ 3 ┆ John ┆ 19 ┆ 2022-12-20 ┆ 3 │
└───────────┴───────┴─────┴───────────────────┴──────────────┘
'''
Use SQL with DuckDB Directly on Polars DataFrame
Yes, you read it right. You can use DuckDB directly on Polars dataframe. It wasn’t possible when I wrote this post, but now you can.
Here’s the example code:
import polars as pl
import duckdb
# read in data
df_pl = pl.scan_csv('../sample_data.csv')
# filtering in duckdb
df_duckdb_filtered = duckdb.query("""
select
*
from df_pl
where Name = 'Mike'
""")
print(df_duckdb_filtered)
# a window function in duckdb
df_duckdb_rank = duckdb.query("""
select
*,
rank() over(order by FirstEnrolledDate) as EnrolledRank
from df_pl
""")
print(df_duckdb_rank)
# convert duckdb to polars dataframe
duckdb_to_polars = df_duckdb_rank.pl()
print(duckdb_to_polars)
What you need to do is basically
- reference dataframe name in your SQL query
- use .pl() to convert duckdb format to Polars dataframe
Conclusion
I hope you can see how DuckDB and Polars work together. It’s fairly easy to go back and forth between the two. Both of these technologies are useful in themselves, and they become even more powerful when you know how to use them together.
Source code: Github repo
Dear Yuki,
Thank you for your article.
If you don’t mind, I would like to ask:
Did you compare the performance of duckDB vs. Polar?
Or in which use cases you would prefer one over the other?
In your other article, e.g., you mentioned about Polars’ lazy API.
You may also know this blog article:
https://dataengineeringcentral.substack.com/p/whats-all-the-hype-with-duckdb
In the comments section there, I made some references as well.
By the way, on the same day you issued this article, duckDB announced a new release with some new features:
https://duckdb.org/2023/02/13/announcing-duckdb-070.html
Have a great day!
P.S. I am not affiliated with duckDB nor with Polars. Just a curious user.
Hi John,
Thanks for your comment! I haven’t compared the performance between polars and duckdb yet, but it’s up next. I just published a new article comparing polars vs pandas.
I haven’t really explored duckdb enough to say something about it, but I’d say I prefer whichever performs better. But I quite like polars syntax. I’ll see what happens once I get familiar with duckdb 🙂