Polars for initial data analysis, Polars for production

Initial data analysis (IDA) has different goals than your final, production data analysis:

  • With IDA you need to examine the initial data and intermediate results, check your assumptions, and try different approaches. Exploratory data analysis has similar requirements.
  • Once you’re happy with your approach, and you’re ready to run the analysis in an automated manner, you care a lot more about speed and resource usage.

These different goals often benefit from different implementation strategies and tools—unless you have a sufficiently flexible tool like Polars, the super-fast dataframe library. In particular, Polars has two fundamental APIs, each of which is useful in different situations:

  • “Eager” mode, which is similar to how Pandas works, is well-suited for initial and exploratory data analysis.
  • For production use, “lazy” mode often execute much faster, with lower memory usage, at the cost of not giving you access to intermediate result.

In this article we’ll use both two APIs and see how Polars lets you transition from looking at the data to something we can run even more efficiently in production.

An example: Generating a report of annual bikeshare usage

The Boston-area Bluebikes bikeshare program is increasingly popular; how many rides happen annually in just one local city, Cambridge? We have data!

To understand how to correctly generate a report from this data we will need to do some work.

Initial data analysis

Step 1. Getting the data

I’m not going to go into too much detail, except to say that:

  1. The ZIP files have inconsistent names, because the system used to be called Hubway and now it’s called Bluebikes.
  2. Likewise, the CSVs are inconsistently named.

Eventually I wrote a script:

#!/bin/bash
set -euo pipefail
wget https://s3.amazonaws.com/hubway-data/20{16,17,18,19,20,21,22}{01,02,03,04,05,06,07,08,09,10,11,12}-hubway-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/2018{01,02,03,04}-hubway-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/2018{05,06,07,08,09,10,11,12}-bluebikes-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/20{19,20,21,22}{01,02,03,04,05,06,07,08,09,10,11,12}-bluebikes-tripdata.zip
wget https://s3.amazonaws.com/hubway-data/2023{01,02}-bluebikes-tripdata.zip
find . -iname "*-hubway-*.zip" -exec unzip {} \;
find . -iname "*-bluebikes-*.zip" -exec unzip {} \;
rm -f *-hubway-*.zip
rm -f *-bluebikes-*.zip
wget https://s3.amazonaws.com/hubway-data/current_bluebikes_stations.csv

At the time of writing data was only available for January and February 2023.

Interlude: Eager APIs vs. Lazy APIs

There are two styles of API in Polars for processing data:

  1. Eager: You tell the API to do something, and it does it immediately. This is how Pandas works.
  2. Lazy: You stack a series of operations (loading the data, filtering the data, summarizing the data). No work is done at this point. Once all the operations are queued up, you tell the library to execute them all in one go. This is how Dask works.

Lazy execution gives the implementation plenty of opportunity for automatic optimization. For example, if you’re not going to use column "endtime" in your analysis, there’s no point in even loading it. In contrast, a eager API has no idea how you will be using the data next, so it just has to do what you told it, with no shortcuts.

For IDA, we’re going to be spending time looking at intermediate steps in the processing, so a lazy API won’t work. On the other hand, an eager API is what you want in production usage since it will be able to run faster.

Unlike Pandas, the Polars DataFrame library provides both kinds of APIs. Since we’re starting with IDA, we’ll start with using the eager API.

Step 2. Figuring out Cambridge’s stations

I’m going to recreate, in the IPython command-line, the initial analysis I did in Jupyter.

First, we import Polars:

In [1]: import polars as pl

And we load the stations:

In [2]: stations = pl.read_csv("data/current_bluebikes_stations.csv", skip_rows=1, columns=["Number", "Name", "District", "Public"])

In [3]: stations
Out[3]: 
shape: (448, 4)
------------------------------------------------
 Number  Name           District    Public 
 ---     ---            ---         ---    
 str     str            str         str    
------------------------------------------------
 K32015  1200 Beacon St Brookline   Yes    
 W32006  160 Arsenal    Watertown   Yes    
 A32019  175 N Harvard  Boston      Yes    
 S32035  191 Beacon St  Somerville  Yes    
                                       
 A32043  Western Ave at Boston      Yes    
         St                                
 B32059  Whittier St He Boston      Yes    
 D32040  Williams St at Boston      Yes    
         St                                
 S32005  Wilson Square  Somerville  Yes    
------------------------------------------------

And now we choose only the Cambridge stations:

In [4]: cambridge_stations = stations.filter(pl.col("District") == "Cambridge")

In [5]: cambridge_stations
Out[5]: 
shape: (78, 4)
-------------------------------------------------------
 Number  Name                   District   Public 
 ---     ---                    ---        ---    
 str     str                    str        str    
-------------------------------------------------------
 M32026  359 Broadway - Broadwa Cambridge  Yes    
         Fayet                                   
 M32054  699 Mt Auburn St       Cambridge  Yes    
 M32060  700 Huron Ave          Cambridge  Yes    
 M32064  75 Binney St           Cambridge  Yes    
                                              
 M32048  Third at Binney        Cambridge  Yes    
 M32040  University Park        Cambridge  Yes    
 M32057  Vassal Lane at Tobin/V Cambridge  Yes    
 M32050  Verizon Innovation Hub Cambridge  Yes    
         Ware S                                  
-------------------------------------------------------

According to this, there are 78 stations in Cambridge. So let’s validate that by checking the relevant page on the city website. It says: “The Bluebikes system has nearly 500 bike share stations, with over 80 located in Cambridge (as of April 2023).”

That doesn’t match the number we found. It’s possible some stations were installed after this CSV was created, but that seems unlikely given construction season has only just started. Or maybe the city website is wrong.

Step 3. Loading the rides

Let’s load all the rides. We’ll need them anyway, and that will give us a way to cross-check stations, by seeing if there are stations in the list of rides that aren’t in the list of stations.

In [6]: rides = pl.read_csv("data/20*.csv")
...
ComputeError: Could not parse `\N` as dtype `i64` at column 'birth year' (column number 14).
The current offset in the file is 760019 bytes.

That’s annoying: one of the columns has corrupt data, or perhaps it uses that symbol to indicate unknown data. This is apparently the birth year of the rider on each ride, which we don’t care about right now.

Let’s look at the columns in the CSVs and try to pick only the ones we do care about:

$ head -1 data/202212-bluebikes-tripdata.csv
"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","postal code"

Notice what’s missing? "birth year" isn’t in the list! Apparently only some of the CSVs have that column, which means we have inconsistent columns across CSVs. That’s no fun.

Luckily, picking only the columns we want allows us to load all the CSVs into a single DataFrame:

In [7]: rides = pl.read_csv("data/20*.csv", columns=["starttime", "start station id", "end station id", "start station name", "end station name"])

This takes a few seconds, and now we have all the rides loaded.

Step 4. Cross-referencing station names

We have two sets of stations: a CSV with station names and locations, and the start and end station for each ride. Let’s make sure they match, and see if we can find the mysterious missing Cambridge stations.

This sounds like a job for a set!

In [8]: ride_stations = set()
In [9]: for column in ["start station name", "end station name"]:
    ...:     ride_stations |= set(rides.get_column(column).to_list())

In [10]: len(ride_stations)
Out[10]: 690

Looking back we see that the stations CSV had 448 rows, and 690 != 448. Let’s inspect the data and see what’s going on, with commentary interleaved:

In [12]: ride_stations
Out[12]: 
{'1200 Beacon St',
 '160 Arsenal',
 '175 N Harvard St',
 '18 Dorrance Warehouse',
 '191 Beacon St',
 '2 Hummingbird Lane at Olmsted Green',
 # (WHAT, oh there's an extra period)
 '30 Dane St',
 '30 Dane St.',
 # (Oh, there's former stations although this one seems strange...)
 '30 Dane St. (former)',
 '359 Broadway - Broadway at Fayette Street',
 # (And temporary winter locations)
 '515 Somerville Ave (Temp. Winter Location)',
 ...
 # (Sometimes with different spelling!)
 'Charles Circle - Charles St at Cambridge St TEMPORARY WINTER LOCATION',
 ...

Well that’s fun: former stations, inconsistent spelling, winter relocations… We can also see which stations are in the rides that don’t exist in the stations CSV, to see if there’s any other categories of inconsistency we’re missing.

Step 5. A different approach to finding stations in Cambridge

But!

Let’s rethink first: we are trying to find rides going to or from Cambridge. One way to do that is to match to the “district” the station is in, information that is only in the stations CSV. But is there another way?

Let’s look again the columns we have available for rides:

"tripduration","starttime","stoptime","start station id","start station name","start station latitude","start station longitude","end station id","end station name","end station latitude","end station longitude","bikeid","usertype","postal code"

Longitude and latitude for each station do give us enough information to find the city, if we run it through a geocoder. Postal code would be even easier, since I’m fairly certain Cambridge ZIP codes don’t cross city boundaries, except—there’s only one, and we want both start station and end station.

Lovely.

Step 6. Fuck it, let’s just generate a report

Probably we want to do geocoding. But the point here is to talk about Polars, so maybe we’re digressing too much.

So let’s just pretend that the stations CSV is complete and accurate, and find all the rides either starting or finishing in Cambridge, using the cambridge_stations we calculated earlier:

In [13]: cambridge_stations = cambridge_stations.get_column("Name")

In [14]: cambridge_rides = rides.filter(pl.col("start station name").is_in(cambridge_stations) | pl.col("end station name").is_in(cambridge_stations))

In [15]: cambridge_rides
Out[15]: 
shape: (7139138, 5)
-------------------------------------------------------
 startti  start   end     start       end        
 me       station station station     station    
 ---      id      id      name        name       
 str      ---     ---     ---         ---        
          i64     i64     str         str        
-------------------------------------------------------
 2016-01  36      67      Boston      MIT at     
 -01 00:                  Public      Mass Ave / 
 15:36                    Library -   Amherst St 
                          700 Boyl              
 2016-01  107     176     Ames St at  Lesley     
 -01 00:                  Main St     University 
 25:13                                           
 2016-01  141     90      Kendall     Lechmere   
 -01 00:                  Street      Station at 
 25:24                                Cambridge  
                                      St        
 2016-01  178     80      MIT         MIT Stata  
 -01 00:                  Pacific St  Center at  
 45:02                    at          Vassar St  
                          Purrington  /         
                          St                     
                                            
 2023-02  97      67      Harvard     MIT at     
 -28 23:                  University  Mass Ave / 
 42:36.4                  River       Amherst St 
 470                      Houses                
 2023-02  334     145     Mass Ave    Rindge     
 -28 23:                  at Hadley/  Avenue -   
 43:10.7                  Walden      O'Neill    │
│ 420     ┆        ┆        ┆            ┆ Library    │
│ 2023-02 ┆ 437    ┆ 68     ┆ Berkshire  ┆ Central    │
│ -28 23: ┆        ┆        ┆ Street at  ┆ Square at  │
│ 54:01.4 ┆        ┆        ┆ Cambridge  ┆ Mass Ave / │
│ 120     ┆        ┆        ┆ St…        ┆ Ess…       │
│ 2023-02 ┆ 553    ┆ 413    ┆ Cambridge  ┆ Kennedy-Lo │
│ -28 23: ┆        ┆        ┆ Crossing   ┆ ngfellow   │
│ 58:20.9 ┆        ┆        ┆ at North   ┆ School 158 │
│ 490     ┆        ┆        ┆ Firs…      ┆ Sp…        │
-------------------------------------------------------

More than 7 million rides, not bad! And this is almost certainly an undercount, given we know we’re probably missing some stations.

Now that we have the rides, we can group them by year:

In [16]: def by_year(df):
    ...:     return df.groupby_dynamic("starttime", every="1y").agg(pl.count())
    ...: 

In [17]: by_year(cambridge_rides)
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[17], line 1
----> 1 by_year(cambridge_rides)

...

ComputeError: expected any of the following dtypes: { Date, Datetime, Int32, Int64 }, got str

Oops. The starttime column is a string, we forgot to parse it into a datetime. Let’s do that.

In [18]: cambridge_rides = cambridge_rides.with_columns(pl.col("starttime").str.slice(0, 10).str.strptime(pl.Date, fmt="%Y-%m-%d", strict=False))

And now we can look at by year counts; I’ll leave making a graph as an exercise to the reader since the table is sufficient to show us that, with a brief speedbump from the pandemic, ridership is ever-increasing:

In [20]: by_year(cambridge_rides)
Out[20]: 
shape: (8, 2)
------------------------
 starttime   count   
 ---         ---     
 date        u32     
------------------------
 2016-01-01  556273  
 2017-01-01  616359  
 2018-01-01  837028  
 2019-01-01  1152015 
 2020-01-01  804192  
 2021-01-01  1296830 
 2022-01-01  1720675 
 2023-01-01  155766  
------------------------

One thing to notice is that 2023 is much lower, because it only has two months’ data; including the data would be misleading, unless we somehow marked it visually as incomplete. So for now we will just omit 2023:

In [21]: by_year(cambridge_rides.filter(pl.col("starttime").dt.year() < 2023))
Out[21]: 
shape: (7, 2)
------------------------
 starttime   count   
 ---         ---     
 date        u32     
------------------------
 2016-01-01  556273  
 2017-01-01  616359  
 2018-01-01  837028  
 2019-01-01  1152015 
 2020-01-01  804192  
 2021-01-01  1296830 
 2022-01-01  1720675 
------------------------

Interlude: What’s involved in initial data analysis?

As we’ve seen, during the initial data analysis, even for a seemingly trivial report, we had to deal with:

  • Inconsistent input files (names and columns).
  • Bad data (whatever the "\N" was in the "birth year" column).
  • Inconsistent information across input types (station CSV has different stations than rides CSVs).
  • Questions about how to get the most accurate data (maybe we should abandon the station list and just use the longitude/latitude of stations + geocoding?).
  • Post-processing data (converting strings to dates).
  • Thinking about different ways to represent data (should we include 2023?).

And more! I omitted some random typos, API usage mistakes, and other problems, just for clarity.

Because we didn’t know about these problems up front, an eager API where we immediately get back results was immensely useful in debugging problems, investigating alternatives, and working through issues.

Switching to a production script

Now that we have an outline of a working analysis, let’s convert it into a standalone script based on what we’ve learned.

An eager production script

As a first pass, we’ll continue using the eager API, basically the same code as above converted into a script:

import polars as pl

stations = pl.read_csv(
    "data/current_bluebikes_stations.csv",
    skip_rows=1,
    columns=["Number", "Name", "District", "Public"],
)
cambridge_stations = stations.filter(
    pl.col("District") == "Cambridge"
).get_column("Name")
rides = (
    pl.read_csv(
        "data/20*.csv",
        columns=[
            "starttime",
            "start station id",
            "end station id",
            "start station name",
            "end station name",
        ],
    )
    .with_columns(
        pl.col("starttime")
        .str.slice(0, 10)
        .str.strptime(pl.Date, fmt="%Y-%m-%d", strict=False)
    )
)
cambridge_rides = rides.filter(
    pl.col("start station name").is_in(cambridge_stations)
    | pl.col("end station name").is_in(cambridge_stations)
)


def by_year(df):
    return df.groupby_dynamic("starttime", every="1y").agg(
        pl.count()
    )


print(
    str(
        by_year(
            cambridge_rides.filter(
                pl.col("starttime").dt.year() < 2023
            )
        )
    )
)

If we run this, it takes:

  • 3.2 wallclock seconds.
  • 12.9 CPU seconds.
  • 4.5 gigabytes of RAM.

Can we do better?

Note: Whether or not any particular tool or technique will speed things up depends on where the bottlenecks are in your software.

Need to identify the performance and memory bottlenecks in your own Python data processing code? Try the Sciagraph profiler, with support for profiling both in development and production on macOS and Linux, and with built-in Jupyter support.

A performance timeline created by Sciagraph, showing both CPU and I/O as bottlenecks
A memory profile created by Sciagraph, showing a list comprehension is responsible for most memory usage

Switching to a lazy API

In the above example, each step along the way executed immediately, sometimes doing unnecessary work. This is how Pandas works as well. But Polars also includes a lazy API that only executes work once we’ve set up the whole query. This might allow it to optimize the queries accordingly.

For the stations CSV, we don’t both switching to the lazy API, because it’s so small (a few hundred rows) that it won’t make a difference. But for the millions of rows in the rides CSVs, we hope it will speed things up.

Switching to the lazy API involves swapping out polars.read_csv() with polars.scan_csv(), and at the end of the query calling collect() to instantiate the final result.

import polars as pl

stations = pl.read_csv(
    "data/current_bluebikes_stations.csv",
    skip_rows=1,
    columns=["Number", "Name", "District", "Public"],
)
cambridge_stations = stations.filter(
    pl.col("District") == "Cambridge"
).get_column("Name")

# Switched to lazy API:
rides = (
    pl.scan_csv("data/20*.csv") # 💖 replaced read_csv() 💖
    .with_columns(
        pl.col("starttime")
        .str.slice(0, 10)
        .str.strptime(pl.Date, fmt="%Y-%m-%d", strict=False)
    )
)
cambridge_rides = rides.filter(
    pl.col("start station name").is_in(cambridge_stations)
    | pl.col("end station name").is_in(cambridge_stations)
)


def by_year(df):
    return df.groupby_dynamic("starttime", every="1y").agg(
        pl.count()
    )


print(
    str(
        by_year(cambridge_rides)
        .filter(pl.col("starttime").dt.year() < 2023)
        .collect()  # 💖 Added .collect() 💖
    )
)

The change is tiny: two lines of code. We also no longer bother specifying which columns to load, since Polars can figure that out on its own, based on our query.

Here’s a comparison of both versions:

Mode Wallclock time (sec) CPU time (sec) Peak memory (GB)
Eager 3.2 12.9 4.5
Lazy 2.3 10.2 1.6

We have reduced CPU usage, improved parallelism allowing the results to came back even faster, and massively reduced memory usage. Not bad!

The downside is that we don’t have access to intermediate results—the query only does exactly what we asked for, and only once we call collect(). But for production use, we don’t care, we’ve already figured out the final query to run.

What about Pandas?

Pandas only offers the equivalent of Polars’ eager mode, and mostly doesn’t have parallelism except maybe when loading CSVs. So while I haven’t implemented the above query in Pandas, I would expect it to be no better than the Polars eager mode, and quite likely much worse.

Comparing eager and lazy timelines

We can use the Sciagraph profiler to compare how these two modes work.

In addition to the more usual flamegraphs which it generates for both speed and memory usage, Sciagraph also generates timelines, which makes it easier to see parallelism and order of operation. These are best viewed on a large window on a desktop computer; they will be hard to view on a phone.

Eager version timeline

Notice we have a series of distinct operations, each triggered by different Python code:


Lazy version timeline

In the lazy version we only have one single Python operation, the collect(), triggering lots of work happening internally in Polars. We can also see that parallelism happens more often in the Polars thread pool.


Give Polars a try!

The ability to do both exploratory data analysis and final production code using the same library is very compelling. What’s more, Polars is fast, and it lets you take advantage of multiple cores. And once you’ve nailed down your queries, switch to lazy mode and everything will run even faster.