One of the features in DuckDB is its integration with other data libraries such as pandas. DuckDB makes it seamless when we convert to and from other dataframes and table formats. This flexibility gives the users the ability to implement DuckDB in their data pipelines with ease.
In this post, I’ll walk you through how to work with pandas, polars, and pyarrow in DuckDB.
You can find the full code in my GitHub repo.
DuckDB with Polars
Execute SQL on Polars in DuckDB – Polars to DuckDB
You can simply run a sql query specifying the dataframe name.
import polars as pl
import duckdb
data = {'ID': [1,2,3,4,5], 'Name': ['Microsoft', 'Apple', 'Netflix', 'Spotify', 'Intel']}
# duckdb on polars dataframe
pl_df = pl.DataFrame(data)
rel = duckdb.sql('select * from pl_df')
print('\nDuckDB relation from Polars df: \n', rel, type(rel))
"""
DuckDB relation from Polars df:
┌───────┬───────────┐
│ ID │ Name │
│ int64 │ varchar │
├───────┼───────────┤
│ 1 │ Microsoft │
│ 2 │ Apple │
│ 3 │ Netflix │
│ 4 │ Spotify │
│ 5 │ Intel │
└───────┴───────────┘
class 'duckdb.DuckDBPyRelation'
"""
DuckDB to Polars
To convert from DuckDB relation object to Polars dataframe, you’d use .pl().
# duckdb to polars
pl_df_from_duckdb = rel.pl()
print('\nPolars df from DuckDB: \n', type(pl_df_from_duckdb))
"""
Polars df from DuckDB:
class 'polars.internals.dataframe.frame.DataFrame'
"""
DuckDB with Pandas
Execute SQL on Pandas in DuckDB – Pandas to DuckDB
import pandas as pd
import duckdb
data = {'ID': [1,2,3,4,5], 'Name': ['Microsoft', 'Apple', 'Netflix', 'Spotify', 'Intel']}
# duckdb on pandas dataframe - pandas to duckdb
df = pd.DataFrame(data)
rel = duckdb.sql('select * from df')
print('\nDuckDB relation from Pandas df: \n', rel, type(rel))
"""
DuckDB relation from Pandas df:
┌───────┬───────────┐
│ ID │ Name │
│ int64 │ varchar │
├───────┼───────────┤
│ 1 │ Microsoft │
│ 2 │ Apple │
│ 3 │ Netflix │
│ 4 │ Spotify │
│ 5 │ Intel │
└───────┴───────────┘
class 'duckdb.DuckDBPyRelation'
"""
DuckDB to Pandas
For pandas, you use .df().
# duckdb to polars
df_from_duckdb = rel.df()
print('\nPandas df from DuckDB: \n', type(df_from_duckdb))
"""
Pandas df from DuckDB:
class 'pandas.core.frame.DataFrame'
"""
DuckDB with Arrow
Execute SQL on Arrow in DuckDB – Arrow to DuckDB
import pyarrow as pa
import duckdb
data = {'ID': [1,2,3,4,5], 'Name': ['Microsoft', 'Apple', 'Netflix', 'Spotify', 'Intel']}
# duckdb on arrow table - arrow to duckdb
arrow = pa.Table.from_pydict(data)
rel = duckdb.sql('select * from arrow')
print('\nDuckDB relation from Arrow table: \n', rel, type(rel))
"""
DuckDB relation from Arrow table:
┌───────┬───────────┐
│ ID │ Name │
│ int64 │ varchar │
├───────┼───────────┤
│ 1 │ Microsoft │
│ 2 │ Apple │
│ 3 │ Netflix │
│ 4 │ Spotify │
│ 5 │ Intel │
└───────┴───────────┘
class 'duckdb.DuckDBPyRelation'
"""
DuckDB to Arrow
For pyarrow, you use .arrow().
# duckdb to arrow
arrow_from_duckdb = rel.arrow()
print('\nArrow table from DuckDB: \n', type(arrow_from_duckdb))
"""
Arrow table from DuckDB:
class 'pyarrow.lib.Table'
"""
Summary
As you just saw, it is super easy to use DuckDB in conjunction with pandas, polars, and pyarrow. Hope this post helps you get started in using DuckDB with other data libraries!