Using breakpoints to explore your code

MP 89: You can hop into your program's execution at any point.

When working on a project, I often find myself wondering if my code does exactly what I think it does. Sometimes I'll have an idea about how to move forward in a program, but won't be quite sure how to implement the idea. You can run your programs as often as you want to see if they do what they're supposed to, but that's not always the most efficient way to work.

I spent an embarrassingly long part of my life as a programmer sprinkling print() calls into my code, just to see what was happening in my programs. Many IDEs offer better ways to explore the current state of your project, and Python has a built-in way to do this as well.

In this post we'll look at a really straightforward way to explore what's happening in your programs, even when using the simplest editors and environments.

Data spread across multiple files

As an example, let's look at a program that ingests some data and converts it to a Python data structure. When migrating between platforms, I had to combine information from a variety of export files into a single file that I could upload to the new platform.

Here's an example CSV file with some information about a few users on the original platform1:

email,name
mitchell_lewis@example.org,Mitchell Lewis
nathan_smith@example.com,Nathan Smith
megan_gardner@example.gov,Megan Gardner
subscribers.csv

That file has only two kinds of information: each person's email address and name.

But there's another critical piece of information, stored in a separate export file:

email,subscriber_id
mitchell_lewis@example.org,5eb63b
nathan_smith@example.com,d4ab93
megan_gardner@example.gov,1f9996
subscriber_ids.csv

Each user's ID was stored in a separate file. I needed to combine the information in these two files into a single file I could use for importing into the new platform.

Ingesting and transforming data

I'm sure there are well-established algorithms for efficiently combining data from multiple CSV files. But I only needed to do this once, so I didn't want to research best practices in this area. I just wanted to write a bit of Python code that would let me generate a valid import file for the new platform. I also wanted to be able to work with this data in Python, because I might want to use the data in a variety of ways on the new platform.

With this kind of work, I often just start with what I know. What do I know that's helpful here? I like working with dataclasses. Let's set up a small dataclass containing the information I want to use:

from dataclasses import dataclass

@dataclass
class Member:
    email: str = ""
    member_id: str = ""
    name: str = ""
generate_import_file.py

That's great; once I have a sequence of Member objects, I can do anything I want with that data.

What next? I know I can read in data from a CSV using pandas, in one line:

from dataclasses import dataclass
from pathlib import Path

import pandas as pd

@dataclass
class Member:
    email: str = ""
    member_id: str = ""
    name: str = ""

path = Path("subscribers.csv")
subscriber_data = pd.read_csv(path)
generate_import_file.py

This works, but this is also the point where I start to slow down. My program reads data in from the CSV file, and it's available as a dataframe, but I always get a little rusty about exactly how to work with data in pandas.

Using breakpoint() as a development tool

The Python standard library has a module called pdb, short for Python Debugger. I think a lot of people avoid using pdb for development work, because they think it's just for debugging existing programs. But these tools are fantastic for development work as well.

In a workflow that I used for far too long, I'd start inserting print() calls such as print(subscriber_data) and print(subscriber_data.shape) into the code I was working on. That's fine once in a while, but if it's your main inspection approach it gets really inefficient. Using pdb's breakpoint, you can do as much exploratory work as you need based on the code you've already written, and work out much larger chunks of your code in one development session. Let's see how it works.

First, import pdb and add a call to breakpoint() at the end of the file:

from dataclasses import dataclass
from pathlib import Path
import pdb

...
path = Path("subscribers.csv")
subscriber_data = pd.read_csv(path)

breakpoint()
generate_import_file.py

Now when you run the file in a terminal, you end up in an interactive debugging session:

$ python generate_import_file.py
--Return--
> generate_import_file.py(16)<module>()->None
-> breakpoint()
(Pdb)

This is a terminal session, where everything in the context of the program at the breakpoint is available. All the libraries that were imported are available, and all the variables and data structures defined in the file to that point are available as well.

Developing in the debugger

In the debugger session, I'm going to explore an approach to getting the data from the dataframe into a list of Member objects.

First, I'll look at subscriber_data:

(Pdb) subscriber_data
                        email            name
0  mitchell_lewis@example.org  Mitchell Lewis
1    nathan_smith@example.com    Nathan Smith
2   megan_gardner@example.gov   Megan Gardner

Okay, that's about what I expected. It's a dataframe with an index, an email column and a name column.

I think I remember how to get the emails out of a dataframe:

(Pdb) subscriber_data["email"]
0    mitchell_lewis@example.org
1      nathan_smith@example.com
2     megan_gardner@example.gov
Name: email, dtype: object

That worked, but I can't remember if dataframes support dot notation:

(Pdb) subscriber_data.email
0    mitchell_lewis@example.org
...

Great, that works!

Now I want to build a list of emails:

(Pdb) emails = list(subscriber_data.email)
(Pdb) emails
['mitchell_lewis@example.org', 'nathan_smith@example.com',
 'megan_gardner@example.gov']

That worked! Now I think I know enough to leave the debugger session, and write some more code. Entering q in the debugger session quits the session:

(Pdb) q
Traceback (most recent call last):
    ...
    if self.quitting: raise BdbQuit
...

If your breakpoint is at the end of the file, you can also enter c for continue. Execution continues to the end of the file, which we've already reached.

Note: Try to avoid using single-letter variable names in pdb sessions, because there are a number of single-letter pdb commands that will conflict with the code in your session.

Here's my code, based on what I learned in the debugger session:

...
path = Path("subscribers.csv")
subscriber_data = pd.read_csv(path)

emails = list(subscriber_data.email)
names = list(subscriber_data.name)

# Build a list of Members.
members = []
for email, name in zip(emails, names):
    member = Member()
    member.email = email
    member.name = name

    members.append(member)

breakpoint()
generate_import_file.py

I pulled the emails and names into two separate lists. I then looped over those lists, and built a new Member instance for each entry in the list.

Now when we run the program, we can see if these Member instances were created correctly:

$ python generate_import_file.py 
...
(Pdb) members
[Member(email='mitchell_lewis@example.org',
        member_id='',
        name='Mitchell Lewis'),
 Member(email='nathan_smith@example.com',
        ...
]
(Pdb) len(members)
3

This is correct. Each item in the list members has an email and a name, and the member_id attribute is empty. There are 3 members in the list.

Finishing the program

I won't show the rest of the program here, because the focus of this post is on using pdb and breakpoint() for development work. I finished the program by grabbing the IDs from the second CSV file, and assigning them to the Member instances.

There is a more efficient way to do this work by manipulating the data frames directly. If I need to do this kind of work again I'll learn that approach, and I'll probably do so by keeping a breakpoint at the end of the file in order to explore the objects that are being created. I really like how much you can learn about a data structure by experimenting with it in an interactive session.

Conclusions

If you've been using print() for development work, try adding a call to breakpoint() in your program the next time you're not exactly sure what's happening. You might be amazed by how much more efficiently you can find a working approach, and how much insight you gain into what's happening in your program.

Some editors and IDEs have integrated debuggers. If the editor you use has that feature, it's almost certainly worthwhile to spend some time learning how to use it. But it's also nice to know that you can drop a breakpoint into just about any Python file on any system, and quickly hop into an interactive session with all of your project's context at your fingertips.

If you're interested in learning more about using pdb, Nina Zakharenko gave an excellent talk at PyCon a few years ago called Goodbye Print, Hello Debugger! I highly recommend it.

Resources

You can find the code files from this post in the mostly_python GitHub repository.


1 This data was generated using Faker, a great library for generating sample data.