How to Inspect and Optimize Query Plans in Python Polars

Introduction

One advantage Polars brings is its ability to evaluate queries lazily (lazy mode) instead of executing code immediately (eager mode). This allows the Polars engine to apply optimizations to your query. Polars uses dataframes for eager evaluation and lazyframes for lazy evaluation.

In this post, we’ll cover how to inspect and optimize query plans in Python Polars. You can find the code in this GitHub repo.

Inspect Query Plans

There are two methods you can use to inspect query plans.

explain
show_graph

The .explain() method shows the query plan in a text format. On the other hand, the .show_graph() method visualizes the query plan. You’ll need to install the GraphViz package and add it to your PATH for using the .show_graph() method.

Here are the code examples for using those methods.

The .explain() method

Copy


import polars as pl

print(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .explain()
)
'''
FILTER [(col("HP")) > (300)] FROM
   WITH_COLUMNS:
   [col("HP").max().alias("Max HP")] 
    Csv SCAN [pokemon.csv]
    PROJECT 3/13 COLUMNS
    SELECTION: [(col("Type 1")) == (String(Grass))]
'''

It returns the optimized query plan by default. If you want to show a non-optimized query plan, you’ll need to set the optimized parameter to False:

Copy


print(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .explain(optimized=False)
)
'''
SELECT [col("Name"), col("Type 1"), col("HP"), col("Max HP")] FROM
  FILTER [(col("HP")) > (300)] FROM
     WITH_COLUMNS:
     [col("HP").max().alias("Max HP")] 
      FILTER [(col("Type 1")) == (String(Grass))] FROM
        Csv SCAN [pokemon.csv]
        PROJECT */13 COLUMNS
'''

If you look at the line that says “Csv SCAN [pokemon.csv]” and below, this non-optimized query plan doesn’t include filtering and column selection at the scan level. Whereas, if you look at the optimized query plan above, there are only 3 columns selected and data is filtered where the “Type 1” column equals “Grass” at the scan level.

As you can see, using a lazyframe benefits from optimizations that are not available in dataframes.

The .show_graph() method

Let’s run the same queries using the .show_graph() method.

Copy


(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .show_graph()
)

The output looks like the following:

We can see the same kinds of optimizations in the graph as we saw with the .explain() method.

Now let’s run the non-optimized query:

Copy


(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .filter(pl.col('HP')>300)
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .show_graph(optimized=False)
)

The output looks like the following:

As you can see, filtering and column selection are not happening at the scan level.

Optimize Query Plans

If you look at the code carefully, you notice that even in the optimized query plan, there is one filter that’s not pushed to the scan level. Which is a filter on the “HP” column where we only keep rows whose HR is greater than 300.

Polars tries its best to find inefficiencies in your query, however, sometimes you may find cases like this. That’s why the .explain() and .show_graph() methods are very useful in finding things that could be optimized.

Now, what I’m going to show is a very simple thing, which is to move up this filtering logic earlier in the query:

Copy


(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .filter(pl.col('HP')>300)
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .show_graph()
)

And this is the output:

Boom! Now we see the filtering logic pushed to the scan level. We successfully optimized the query!

Check the Execution Time for Each Operation in your Query

Aside from the two methods introduced above, there is the .profile() method that shows a table containing the information on how long it took for Polars to execute each operation. This is useful when trying to identify the exact operation that might needs some tweaking.

Copy


(
    pl.scan_csv('pokemon.csv')
    .filter(pl.col('Type 1')=='Grass')
    .filter(pl.col('HP')>300)
    .with_columns(pl.col('HP').max().alias('Max HP'))
    .select('Name', 'Type 1', 'HP', 'Max HP')
    .profile()
)
'''
(shape: (0, 4)
 ┌──────┬────────┬─────┬────────┐
 │ Name ┆ Type 1 ┆ HP  ┆ Max HP │
 │ ---  ┆ ---    ┆ --- ┆ ---    │
 │ str  ┆ str    ┆ i64 ┆ i64    │
 ╞══════╪════════╪═════╪════════╡
 └──────┴────────┴─────┴────────┘,
 shape: (3, 3)
 ┌─────────────────────────────┬───────┬──────┐
 │ node                        ┆ start ┆ end  │
 │ ---                         ┆ ---   ┆ ---  │
 │ str                         ┆ u64   ┆ u64  │
 ╞═════════════════════════════╪═══════╪══════╡
 │ optimization                ┆ 0     ┆ 51   │
 │ csv(pokemon.csv, predicate) ┆ 51    ┆ 1825 │
 │ with_column(Max HP)         ┆ 1864  ┆ 2041 │
 └─────────────────────────────┴───────┴──────┘)
'''

The .profile() method returns a tuple containing a materialized dataframe and a table with profiling information. Note that in our specific example, there are no rows returned because the filters in the query filtered out all the rows.

Summary

Although Polars is already fast and efficient, it’s good to know how to inspect and optimize your queries in Polars!

P.S. I include tips like this in Polars Cookbook. If you haven’t, make sure to check it out on GitHub and Amazon!

References

https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.explain.html

https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.show_graph.html

https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.profile.html

Introduction

Inspect Query Plans

The .explain() method

The .show_graph() method

Optimize Query Plans

Check the Execution Time for Each Operation in your Query

Summary

References

Leave a Reply Cancel reply