Menu
Stuff by Yuki
  • Home
  • Python
  • Power BI
  • Tableau
  • Community
    • Makeover Monday
    • Workout Wednesday
  • About
  • Contact
Stuff by Yuki

Read from and Write to Amazon S3 in Polars

Posted on June 17, 2023June 17, 2023
Image by Sixteen Miles Out on Unsplash

How do you work with Amazon S3 in Polars? Amazon S3 bucket is one of the most common object stores for data projects. Polars being a fairly new technology, there is not a ton of resources that explain how to work with S3.

In this post, I’ll walk you through reading from and writing to S3 bucket in Polars, specifically csv and parquet files.

You can see my full code in my GitHub repo.

Read from S3 in Polars

Let’s say you have a file containing information like this in a S3 bucket (I get this example data found in a book called “Data Pipelines Pocket Reference“):

order_idstatusdatetime
1Backordered2020-06-01 12:00:00
1Shipped2020-06-09 12:00:25
2Shipped2020-07-11 03:05:00
1Shipped2020-06-09 11:50:00
3Shipped2020-07-12 12:00:00

There are 2 ways I’ve found you can read from S3 in Polars. One way is what’s introduced in Polars documentation. Another way is to make it so that you simply read from a S3 file system, just like you do in your local file system using code like “with open()…”.

You need 2 other libraries for the first approach, s3fs and pyarrow. What you’d basically do is to read a file in S3 through s3fs as pyarrow dataset, and then you convert it to a Polars dataframe (Make sure you have necessary configurations for s3fs to work, such as setting up and specifying IAM profile for AWS).

You can use the piece of code from Polars’ documentation, which utilizes .from_arrow() but I modified it a little bit so that it gets the data as lazyframe by using .scan_pyarrow_dataset(). I also made it so that you use dataset.Dataset() instead of parquet.ParquetDataset() to be able to specify file format.

Parquet

Copy Copied Use a different Browser

import polars as pl
import pyarrow.dataset as ds
import s3fs
from config import BUCKET_NAME

# set up 
fs = s3fs.S3FileSystem(profile='s3_full_access')

# read parquet
dataset = ds.dataset(f"s3://{BUCKET_NAME}/order_extract.parquet", filesystem=fs, format='parquet')
df_parquet = pl.scan_pyarrow_dataset(dataset)
print(df_parquet.collect().head())

For reading a csv file, you just change format=’parquet’ to format=’csv’.

Another way is rather simpler. You’re just reading a file in binary from a filesystem.

Copy Copied Use a different Browser

import polars as pl
import s3fs
from config import BUCKET_NAME

# set up 
fs = s3fs.S3FileSystem(profile='s3_full_access')

# read parquet 2 
with fs.open(f'{BUCKET_NAME}/order_extract.parquet', mode='rb') as f:
    print(pl.read_parquet(f).head())

Write to S3 in Polars

For writing to S3, you’re basically taking the second approach explained above. So the only dependency you need is s3fs library.

Copy Copied Use a different Browser

import polars as pl
import s3fs
from config import BUCKET_NAME

# prep df
df = pl.DataFrame({
    'ID': [1,2,3,4],
    'Direction': ['up', 'down', 'right', 'left']
})

# set up 
fs = s3fs.S3FileSystem(profile='s3_full_access')

# write parquet
with fs.open(f'{BUCKET_NAME}/direction.parquet', mode='wb') as f:
    df.write_parquet(f)

Summary

I hope this post gives you an idea of how you can work with files in S3 bucket. Please reach out if you know better ways or more efficient ways to read from and write to S3 in Polars!

GitHub repo

References

  • https://pola-rs.github.io/polars-book/user-guide/io/aws/
  • https://medium.com/@louis_10840/how-to-process-data-stored-in-amazon-s3-using-polars-2305bf064c52
  • https://stackoverflow.com/questions/75115246/with-python-is-there-a-way-to-load-a-polars-dataframe-directly-into-an-s3-bucke

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • How to Convert String to Date or Datetime in Polars
  • Aggregations Over Multiple Columns in Polars
  • DuckDB with Polars, Pandas, and Arrow
  • Read from and Write to Amazon S3 in Polars
  • Handling Missing Values in Polars

Popular Posts

  • A Running Total Calculation with Quick Measure in Power BI
  • How To Copy And Paste Report Page in Power BI
  • How to Fill Dates Between Start Date and End Date in Power BI (Power Query)
  • Year-Over-Year Calculation: Time Intelligence in Power BI
  • Network Visualizations in Python

connect with me

  • LinkedIn
  • Twitter
  • Github
©2023 Stuff by Yuki | Powered by SuperbThemes & WordPress
x
x