Menu
Stuff by Yuki
  • Home
  • Python
  • Power BI
  • Tableau
  • Community
    • Makeover Monday
    • Workout Wednesday
  • About
  • Contact
Stuff by Yuki

Polars with DuckDB – Using SQL in Polars

Posted on February 13, 2023February 20, 2023

Polars and DuckDB

We see that Polars and DuckDB seem to be the same kind of tools, where we can use them for data analytics, data science, and data engineering tasks, with great performance. I’m not going into which one is better than the other, but I’d like to show you a good way to use DuckDB to compliment Polars.

How to Use SQL in Polars

When it comes to complex data wrangling tasks, I’m personally more conformable doing them in SQL because I’ve used it more than anything else in production environment. Also, at many companies, these tasks are usually done in tools such as dbt and other ELT/ETL tools that utilize SQL. Though there are cases I see implementing transformation logic in Python or other tools is easier than in SQL. But it all depends on the context of your project.

Why use SQL in Polars? I really don’t have to, but I was playing with Polars the other day, I thought, “It would be great if I can test the output in SQL to double check the validity of my calculations I did in Polars”, given that I’m fairly new to Polars.

In fact, there is a method read_sql() that you can read from a database in Polars, but to apply further data transformations, you’d need to use whatever is available in Polars. You can actually run SQL syntax in Polars. Luca shared a tip on his LinkedIn post the other way that I didn’t know existed!

Use SQL in Polars

I researched ways to use SQL in Polars and I found that you can use DuckDB to use SQL and convert back and force between Polars dataframe and Apache Arrow format, with which you can write a query in DuckDB. It’s already introduced in this DuckDB documentation, but I’ll reiterate how it can be done in this post.

If you’re following along, you can get the data and source code from this github repo. Please make sure you install duckdb and polars Python libraries before you move on.

First you need to read in your data. Check out another blog post on how to read csv files in Polars if you’d like. After doing so, you need to convert it to Apache Arrow format. This is done by using .to_arrow() method available in Polars.

Copy Copied Use a different Browser

import polars as pl
import duckdb

# read in data
df_pl = pl.scan_csv('../sample_data.csv').collect()
df_arrow = df_pl.to_arrow()
print(df_arrow)
'''
pyarrow.Table
studentId: int64
Name: large_string
Age: int64
FirstEnrolledDate: large_string
----
studentId: [[1,2,3]]
Name: [["Mike","Sarah","John"]]
Age: [[24,33,19]]
FirstEnrolledDate: [["2020-01-17","2021-07-23","2022-12-20"]]
'''

As you can see, the output of arrow formatted data is not as readable as a dataframe.

From here, I’ll apply some transformations in both Polars and DuckDB to see how that can be done.

Firstly, let’s do a simple transformation where we just filter the dataframe.

Copy Copied Use a different Browser

# filtering in polars 
df_pl = (
    df_pl
    .filter(pl.col('Name')=='Mike')
)
print(df_pl)
'''
shape: (1, 4)
┌───────────┬──────┬─────┬───────────────────┐
│ studentId ┆ Name ┆ Age ┆ FirstEnrolledDate │
│ ---       ┆ ---  ┆ --- ┆ ---               │
│ i64       ┆ str  ┆ i64 ┆ str               │
╞═══════════╪══════╪═════╪═══════════════════╡
│ 1         ┆ Mike ┆ 24  ┆ 2020-01-17        │
└───────────┴──────┴─────┴───────────────────┘
'''

Copy Copied Use a different Browser

# filtering in duckdb
df_duckdb = duckdb.query("""
  select 
    *
  from df_arrow
  where Name = 'Mike'
  
""")
print(df_duckdb)
'''
---------------------
--- Relation Tree ---
---------------------
Subquery

---------------------
-- Result Columns  --
---------------------
- studentId (BIGINT)
- Name (VARCHAR)
- Age (BIGINT)
- FirstEnrolledDate (VARCHAR)

---------------------
-- Result Preview  --
---------------------
studentId       Name    Age     FirstEnrolledDate
BIGINT  VARCHAR BIGINT  VARCHAR
[ Rows: 1]
1       Mike    24      2020-01-17
'''

In using DuckDB, you specify the name of your dataframe in your SQL query directly.

One thing I noticed is that DuckDB’s output has too verbose in my opinion, but we can see the result in a table format at least.

Now, let’s apply a simple window function both in Polars and DuckDB.

RANK() in Polars – Window Function

There is an existing rank() method in Polars. We add a new column that ranks each row based on enrolled date.

Copy Copied Use a different Browser

# a window function in polars
df_pl_rank = (
    df_pl
    .with_columns([
        pl.col('FirstEnrolledDate').rank().cast(pl.Int64).alias('EnrolledRank')
    ])
)
print(df_pl_rank)
'''
shape: (3, 5)
┌───────────┬───────┬─────┬───────────────────┬──────────────┐
│ studentId ┆ Name  ┆ Age ┆ FirstEnrolledDate ┆ EnrolledRank │
│ ---       ┆ ---   ┆ --- ┆ ---               ┆ ---          │
│ i64       ┆ str   ┆ i64 ┆ str               ┆ i64          │
╞═══════════╪═══════╪═════╪═══════════════════╪══════════════╡
│ 1         ┆ Mike  ┆ 24  ┆ 2020-01-17        ┆ 1            │
│ 2         ┆ Sarah ┆ 33  ┆ 2021-07-23        ┆ 2            │
│ 3         ┆ John  ┆ 19  ┆ 2022-12-20        ┆ 3            │
└───────────┴───────┴─────┴───────────────────┴──────────────┘
'''

Copy Copied Use a different Browser

# a window function in duckdb
df_duckdb_rank = duckdb.query("""
  select 
    *,
    rank() over(order by FirstEnrolledDate) as EnrolledRank

  from df_arrow
  
""")
print(df_duckdb_rank)
'''
---------------------
--- Relation Tree ---
---------------------
Subquery

---------------------
-- Result Columns  --
---------------------
- studentId (BIGINT)
- Name (VARCHAR)
- Age (BIGINT)
- FirstEnrolledDate (VARCHAR)
- EnrolledRank (BIGINT)

---------------------
-- Result Preview  --
---------------------
studentId       Name    Age     FirstEnrolledDate       EnrolledRank
BIGINT  VARCHAR BIGINT  VARCHAR BIGINT
[ Rows: 3]
1       Mike    24      2020-01-17      1
2       Sarah   33      2021-07-23      2
3       John    19      2022-12-20      3
'''

Convert DuckDB to Polars DataFrame

In order to convert DuckDB query result to Polars’ dataframe, you’d need to convert the query to arrow format and to Polars dataframe.

You essentially use .arrow() and wrap it around pl.DataFrame() to convert the arrow format to Polars dataframe.

Copy Copied Use a different Browser

# convert duckdb to polars dataframe
duckdb_to_polars = pl.DataFrame(df_duckdb_rank.arrow())
print(duckdb_to_polars)
'''
shape: (3, 5)
┌───────────┬───────┬─────┬───────────────────┬──────────────┐
│ studentId ┆ Name  ┆ Age ┆ FirstEnrolledDate ┆ EnrolledRank │
│ ---       ┆ ---   ┆ --- ┆ ---               ┆ ---          │
│ i64       ┆ str   ┆ i64 ┆ str               ┆ i64          │
╞═══════════╪═══════╪═════╪═══════════════════╪══════════════╡
│ 1         ┆ Mike  ┆ 24  ┆ 2020-01-17        ┆ 1            │
│ 2         ┆ Sarah ┆ 33  ┆ 2021-07-23        ┆ 2            │
│ 3         ┆ John  ┆ 19  ┆ 2022-12-20        ┆ 3            │
└───────────┴───────┴─────┴───────────────────┴──────────────┘
'''

Conclusion

I hope you can see how DuckDB and Polars work together. It’s fairly easy to go back and forth between the two. Both of these technologies are useful in themselves, and they become even more powerful when you know how to use them together.

Source code: Github repo

2 thoughts on “Polars with DuckDB – Using SQL in Polars”

  1. John Edward says:
    February 14, 2023 at 6:42 am

    Dear Yuki,

    Thank you for your article.
    If you don’t mind, I would like to ask:
    Did you compare the performance of duckDB vs. Polar?
    Or in which use cases you would prefer one over the other?
    In your other article, e.g., you mentioned about Polars’ lazy API.

    You may also know this blog article:
    https://dataengineeringcentral.substack.com/p/whats-all-the-hype-with-duckdb

    In the comments section there, I made some references as well.

    By the way, on the same day you issued this article, duckDB announced a new release with some new features:
    https://duckdb.org/2023/02/13/announcing-duckdb-070.html

    Have a great day!

    P.S. I am not affiliated with duckDB nor with Polars. Just a curious user.

    Reply
    1. Yuki says:
      February 20, 2023 at 11:13 pm

      Hi John,

      Thanks for your comment! I haven’t compared the performance between polars and duckdb yet, but it’s up next. I just published a new article comparing polars vs pandas.

      I haven’t really explored duckdb enough to say something about it, but I’d say I prefer whichever performs better. But I quite like polars syntax. I’ll see what happens once I get familiar with duckdb 🙂

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • What are Power BI Dataflows?
  • Calculate the Max Sales Amount for Product
  • Convert DataFrame to Series in Polars
  • Pandas vs Polars – Speed Comparison
  • Polars with DuckDB – Using SQL in Polars

Popular Posts

  • A Running Total Calculation with Quick Measure in Power BI
  • How To Copy And Paste Report Page in Power BI
  • Year-Over-Year Calculation: Time Intelligence in Power BI
  • How to Fill Dates Between Start Date and End Date in Power BI (Power Query)
  • Network Visualizations in Python

connect with me

  • LinkedIn
  • Twitter
  • Github
©2023 Stuff by Yuki | Powered by SuperbThemes & WordPress