Menu
Stuff by Yuki
  • Home
  • Data Engineering
    • Python
  • Business Intelligence
    • Power BI
    • Tableau
  • Perspectives
  • About
  • Contact
Stuff by Yuki

How to Inspect and Optimize Query Plans in Python Polars

Posted on September 4, 2024September 4, 2024
Image by Joanna on Unsplash

Introduction

One advantage Polars brings is its ability to evaluate queries lazily (lazy mode) instead of executing code immediately (eager mode). This allows the Polars engine to apply optimizations to your query. Polars uses dataframes for eager evaluation and lazyframes for lazy evaluation.

In this post, we’ll cover how to inspect and optimize query plans in Python Polars. You can find the code in this GitHub repo.

Inspect Query Plans

There are two methods you can use to inspect query plans.

  • explain
  • show_graph

The .explain() method shows the query plan in a text format. On the other hand, the .show_graph() method visualizes the query plan. You’ll need to install the GraphViz package and add it to your PATH for using the .show_graph() method.

Here are the code examples for using those methods.

The .explain() method

Copy Copied Use a different Browser

import polars as pl

print(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .explain()
)
'''
FILTER [(col("HP")) > (300)] FROM
   WITH_COLUMNS:
   [col("HP").max().alias("Max HP")] 
    Csv SCAN [pokemon.csv]
    PROJECT 3/13 COLUMNS
    SELECTION: [(col("Type 1")) == (String(Grass))]
'''

It returns the optimized query plan by default. If you want to show a non-optimized query plan, you’ll need to set the optimized parameter to False:

Copy Copied Use a different Browser

print(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .explain(optimized=False)
)
'''
SELECT [col("Name"), col("Type 1"), col("HP"), col("Max HP")] FROM
  FILTER [(col("HP")) > (300)] FROM
     WITH_COLUMNS:
     [col("HP").max().alias("Max HP")] 
      FILTER [(col("Type 1")) == (String(Grass))] FROM
        Csv SCAN [pokemon.csv]
        PROJECT */13 COLUMNS
'''

If you look at the line that says “Csv SCAN [pokemon.csv]” and below, this non-optimized query plan doesn’t include filtering and column selection at the scan level. Whereas, if you look at the optimized query plan above, there are only 3 columns selected and data is filtered where the “Type 1” column equals “Grass” at the scan level.

As you can see, using a lazyframe benefits from optimizations that are not available in dataframes.

The .show_graph() method

Let’s run the same queries using the .show_graph() method.

Copy Copied Use a different Browser

(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .show_graph()
)

The output looks like the following:

We can see the same kinds of optimizations in the graph as we saw with the .explain() method.

Now let’s run the non-optimized query:

Copy Copied Use a different Browser

(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .show_graph(optimized=False)
)

The output looks like the following:

As you can see, filtering and column selection are not happening at the scan level.

Optimize Query Plans

If you look at the code carefully, you notice that even in the optimized query plan, there is one filter that’s not pushed to the scan level. Which is a filter on the “HP” column where we only keep rows whose HR is greater than 300.

Polars tries its best to find inefficiencies in your query, however, sometimes you may find cases like this. That’s why the .explain() and .show_graph() methods are very useful in finding things that could be optimized.

Now, what I’m going to show is a very simple thing, which is to move up this filtering logic earlier in the query:

Copy Copied Use a different Browser

(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .filter(pl.col('HP')>300)
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .show_graph()
)

And this is the output:

Boom! Now we see the filtering logic pushed to the scan level. We successfully optimized the query!

Check the Execution Time for Each Operation in your Query

Aside from the two methods introduced above, there is the .profile() method that shows a table containing the information on how long it took for Polars to execute each operation. This is useful when trying to identify the exact operation that might needs some tweaking.

Copy Copied Use a different Browser

(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .filter(pl.col('HP')>300)
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .profile()
)
'''
(shape: (0, 4)
 ┌──────┬────────┬─────┬────────┐
 │ Name ┆ Type 1 ┆ HP  ┆ Max HP │
 │ ---  ┆ ---    ┆ --- ┆ ---    │
 │ str  ┆ str    ┆ i64 ┆ i64    │
 ╞══════╪════════╪═════╪════════╡
 └──────┴────────┴─────┴────────┘,
 shape: (3, 3)
 ┌─────────────────────────────┬───────┬──────┐
 │ node                        ┆ start ┆ end  │
 │ ---                         ┆ ---   ┆ ---  │
 │ str                         ┆ u64   ┆ u64  │
 ╞═════════════════════════════╪═══════╪══════╡
 │ optimization                ┆ 0     ┆ 51   │
 │ csv(pokemon.csv, predicate) ┆ 51    ┆ 1825 │
 │ with_column(Max HP)         ┆ 1864  ┆ 2041 │
 └─────────────────────────────┴───────┴──────┘)
'''

The .profile() method returns a tuple containing a materialized dataframe and a table with profiling information. Note that in our specific example, there are no rows returned because the filters in the query filtered out all the rows.

Summary

Although Polars is already fast and efficient, it’s good to know how to inspect and optimize your queries in Polars!

P.S. I include tips like this in Polars Cookbook. If you haven’t, make sure to check it out on GitHub and Amazon!

References

  • https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.explain.html
  • https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.show_graph.html
  • https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.profile.html

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Where I’m Headed in the Next 5 Years
  • Open-Source vs Vendor Data Tools
  • Developing the Habit of Writing
  • How to Inspect and Optimize Query Plans in Python Polars
  • Learn Python Polars with Polars Cookbook

Popular Posts

  • A Running Total Calculation with Quick Measure in Power BI
  • A Complete Guide to Git Integration in Power BI
  • How To Copy And Paste Report Page in Power BI
  • Handling Missing Values in Polars
  • How to Convert String to Date or Datetime in Polars

connect with me

  • LinkedIn
  • Twitter
  • Github
  • Website

Search Articles

©2025 Stuff by Yuki | Powered by SuperbThemes