Learn Python Polars with Polars Cookbook

Polars has been gaining popularity more and more since its birth. You may have already used it or you may be considering giving it a try.

This blog post will cover what Polars Cookbook is, who it is for as well as what Polars is in the first place, and why you might benefit from using it in your data projects.

You can preorder Polars Cookbook on Amazon or the Packt website. The book will be published on August 23, 2024. Here’s the GitHub repo that contains all the code used in the book. It is under the MIT license.

What is Polars?

Polars is an open-source data processing library built in Rust. Polars uses Apache Arrow Columnar format as the memory model. It’s available in several programming languages such as Rust, Python, Node.js, and R. The typical use case is to use Polars in Python to replace pandas or PySpark for more efficient data processing. If you’re a pandas user, think of Polars as its successor.

Polars has many unique features including (referenced from the Polars GitHub page):

Lazy | eager execution
Multi-threaded
SIMD
Query optimization
Powerful expression API
Hybrid Streaming (larger-than-RAM datasets)

To learn more about Polars and its key features, I recommend you visit the Polars user guide.

One key thing to remember is that Polars is an open-source tool that runs on a single machine just like pandas, and not for distributed workloads that you use Spark or Dask for. However, the Polars company is working on a distributed computing solution for Polars, Polars Cloud, which is exciting!

Why Should You Use Polars?

4 Reasons Why You Should Use Polars

Firstly, Polars is blazingly fast. It’s designed to use all the available cores and memory on your machine efficiently. It also lets you run your transformation logic in parallel where possible.

Secondly, Polars has a clean, expression-centric API. It’s the most expressive API I’ve ever used for a dataframe library. Once you learn the syntax, it’s a breeze to implement any transformation logic in a declarative manner. This is one of the reasons why people build tools like Narwhals leveraging the power of Polars API.

Thirdly, Polars allows you to use both lazy and eager modes. Eager mode or eager evaluation is when operations are executed immediately as they’re called. So, there is no space for any query optimization to play a role in eager mode. On the other hand, lazy mode is where operations are NOT executed until the results are needed. This allows Polars’ engine to look at the execution plan as a whole to apply query optimizations.

Finally, Polars allows you to process datasets larger than the available RAM on your machine. To put it differently, Polars can handle out-of-core or larger-than-RAM processing.

And the list doesn’t stop here. There are other exciting features such as being able to build your own plugins as well as its interoperability with other libraries.

An Example Use Case of Python Polars

I had the opportunity to create data processing pipelines in Python Polars for one of my consulting clients last year to benchmark the performance against pandas and DuckDB. The initial benchmark showed that Polars performed 30x~ compared to both pandas and DuckDB. They then onboarded a SQL veteran to optimize DuckDB queries, which improved DuckDB performance by a lot, but Polars was still more performant than DuckDB by about 15%.

The degree to which you can improve the performance of your data processing pipeline varies depending on the nature and complexity of your code, but I hope you can see how impactful it may be to implement your data processing pipelines in Python Polars.

What is Polars Cookbook?

Polars Cookbook is a book that contains practical recipes for data transformation, manipulation, and analysis in Python Polars. It’s structured so that you start with the fundamentals of Python Polars and move on to more advanced topics in later chapters.

What are the Topics Covered in the Book?

There are 12 chapters, starting with the core, unique concepts of Polars. You’ll learn basic aggregations, window functions, how to work with missing values, how to visualize data, how to manipulate strings and nested structures, how to work with time series data, and more. Later in the book, it even covers how to work with common cloud data sources such as Amazon S3, Azure Data Lake, Google Cloud Storage, BigQuery, and Snowflake. Another great topic covered in the book is testing and debugging with Python Polars. It’ll also cover how you might optimize your queries and apply unit tests in your code.

Here are the topics covered in each chapter of the book:

Chapter 1: Getting Started with Python Polars
- Introducing Key Features in Polars
- The Polars DataFrame
- The Polars Series
- The Polars LazyFrame
- Selecting columns and filtering data
- Creating, modifying, and deleting columns
- Understanding method chaining
- Processing datasets larger than RAM

Chapter 2: Reading and Writing Files
- Reading and writing CSV files
- Reading and writing parquet files
- Reading and writing delta tables
- Reading and writing JSON files
- Reading and writing Excel files
- Reading and writing other data file formats
- Reading and writing multiple files
- Working with databases

Chapter 3: An Introduction to Data Analysis in Python Polars
- Inspecting the DataFrame
- Casting data types
- Handling duplicate values
- Masking sensitive data
- Visualizing data using Plotly
- Detecting and handling outliers

Chapter 4: Data Transformation Techniques
- Exploring basic aggregations
- Using group by aggregations
- Aggregating values across multiple columns
- Computing with window functions
- Applying UDFs
- Using SQL for data transformations

Chapter 5: Handling Missing Data
- Identifying missing data
- Deleting rows and columns containing missing data
- Filling missing data

Chapter 6: Performing String Manipulations
- Filtering strings
- Converting strings into a Date/Datetime/Time
- Extracting substrings
- Cleaning strings
- Splitting strings into lists and structs
- Concatenating and combining strings

Chapter 7: Working with Nested Data Structures
- Creating lists
- Aggregating elements in lists
- Accessing and selecting elements in lists
- Applying logic to each element in lists
- Working with structs and JSON data

Chapter 8: Reshaping and Tidying Data
- Turning columns into rows
- Turning rows into columns
- Joining DataFrames
- Concatenating DataFrames
- Other reshaping techniques

Chapter 9: Time Series Analysis
- Working with date and time
- Applying rolling windows calculations
- Resampling techniques
- Time series forecasting with the functime library

Chapter 10: Interoperability With Other Python Libraries
- Converting to and from a pandas DataFrame
- Converting to and from NumPy arrays
- Interoperating with PyArrow
- Integration with DuckDB

Chapter 11: Working With Common Cloud Sources
- Amazon S3
- Azure Blog Storage
- Google Cloud Storage
- BigQuery
- Snowflake

Chapter 12: Testing and Debugging in Polars
- Debugging chained operations
- Inspecting and optimizing the query plan
- Testing data quality with cuallee
- Running unit tests with Pytest

Who Should Read Polars Cookbook?

The target audiences are data scientists, data analysts, and data engineers who use Python to build data processing pipelines. Maybe you’re pandas power user who analyzes data and builds machine learning models. Maybe you use PySpark to build ETL logic.

Another target audience is those who know Python, but don’t have experience with a dataframe library. Polars Cookbook gives you a good, general introduction to a dataframe library in Chapter 1, covering the basic concepts around dataframes as well as Polars’ unique features.

To put it simply, whether you’re a seasoned pandas/PySpark user or someone who wants to learn a dataframe library for the first time, this book will get you up to speed in using Python Polars.

Conclusion

One of the challenges Polars is facing is the lack of learning resources. If you google about how to do something in pandas, hundreds of articles and websites will pop up on your search. ChatGPT or LLMs are great at answering questions regarding pandas code, too. But that’s because there is a ton of resources out there already. And this is something that Polars lacks at the moment. Yes, it has a great user guide and API documentation, but that’s usually not enough to learn the ins and outs of how to leverage a tool to its full extent.

And Polars Cookbook has been written to fill that gap. I hope that the book will be useful in your journey of learning Python Polars. Whether you read it from start to finish, or have it sit on your bookshelf and use it as a reference when needed, I’d be happy to hear that you use it to learn Python Polars.

Feel free to reach out to me for any questions or feedback regarding the book at yuki@oremdata.com or on LinkedIn.

Links: