Backporting a Django ORM Feature with Database Instrumentation
Last week I covered Django’s database instrumentation, and making a wrapper that’s always installed. Here’s a different use case that I encountered last year on a project.
I was helping optimize query patterns for a model - let’s call it Book
. An import function used a loop to import new model instances based on a unique title
field:
for title in titles:
Book.objects.get_or_create(title=title, defaults=...)
This function took several seconds as it inserted several hundred records, required for the project to work. This was particularly noticeable at the start of tests.
(Surprise, this is also a test optimization story!)
We can speed up such an import with the QuerySet.bulk_create()
method, passing ignore_conflicts=True
. This method inserts a list of instances into the database in as few queries as possible (as little as one). The ignore_conflicts=True
option skips any instance that conflicts with a unique or primary key constraint:
Book.objects.bulk_create(
[Book(title=title, ...) for title in titles],
ignore_conflicts=True,
)
This is a great solution, but unfortunately it wasn’t available. The client’s project was on Django 2.0 at the time and the ignore_conflicts
argument to bulk_create()
was added in 2.2.
Upgrading Django would be a rather large yak shave. Instead, I decided to backport the behaviour of ignore_conflicts=True
.
Normal backporting takes a lot of work:
- Fork the Django repository
- Merge the commit from the later Django version back to your version’s branch (dealing with any merge conflicts)
- Install Django from the forked repository
- Keep your fork up to date with every new minor version.
Instead of doing that, I opted to try backport the behaviour in a different way. I used database instrumentation to edit the SQL generated by the ORM to match what ignore_conflicts=True
generates. I implemented this in a context manager:
from django.db import connection
with make_bulk_create_ignore_conflicts(connection):
Book.objects.bulk_create(
[Book(title=title, ...) for title in titles],
ignore_conflicts=True,
)
The context manager was a little more involved than solely setting up the database instrumentation wrapper:
from contextlib import contextmanager
from unittest import mock
import django
if django.VERSION >= (2, 2):
raise AssertionError(
"On Django 2.2+, replace make_bulk_create_ignore_conflicts with use"
+ " of bulk_create's new argument ignore_conflicts=True"
)
@contextmanager
def make_bulk_create_ignore_conflicts(connection):
# Temporarily disable the backend feature flag to pretend to Django that
# PostgreSQL doesn't support "RETURNING". This prevents it fetching the
# ID's back, which is a bit different to normal bulk_create, but it
# generates SQL which we can easily append to
patch_features = mock.patch.object(
connection.features, "can_return_ids_from_bulk_insert", False
)
# Rewrite SQL to add the same SQL suffix on INSERT statements that Django
# 2.2+ uses for ignore_conflict=True
# https://github.com/django/django/blob/stable/2.2.x/django/db/models/sql/compiler.py#L1316
def make_inserts_conflict_do_nothing(execute, sql, params, many, context):
if sql.startswith("INSERT INTO"):
sql += " ON CONFLICT DO NOTHING"
return execute(sql, params, many, context)
apply_wrapper = connection.execute_wrapper(make_inserts_conflict_do_nothing)
with patch_features, apply_wrapper:
yield
Let’s look through the code one step at a time.
The first step is the Django version check. This is important so that after upgrading Django the project switches to the official feature. Without such checks, projects tend to accumulate workarounds. It raises an exception at import time to ensure it’s spotted early on when upgrading, regardless of test coverage.
The second step is the definition of the context manager itself. It uses the contextlib.contextmanager
decorator for simplicity. I made the name excessively long so it’s noticeable as doing something unusual.
The third step is the patching of connection.features
using unittest
’s mock.patch.object
. connection.features
is part of Django’s undocumented (but mostly stable) database backend API. The attributes on features
control SQL generation so the output is suitable for the given database. The patch temporarily disables addition of the RETURNING
clause on INSERT
queries, to allow easy addition of the ON CONFLICT
clause.
Generally, patching global objects is not thread-safe. But it’s okay here because Django creates separate database connection objects per thread.
The fourth step is the database instrumentation wrapper function, make_inserts_conflict_do_nothing()
. This wrapper inspects each query on its way to the database. If the SQL is an INSERT
query, it appends the correct " ON CONFLICT DO NOTHING"
suffix, as per the PostgreSQL INSERT documentation.
The fifth step is the with
statement that applies the patch and adds the database instrumentation temporarily. yield
then pauses the @contextmanager
function until the user of our context manager exits its with
statement.
Fin
This was a fun exercise in combining several Django features. I’ve since upgraded the client project to Django 2.2, which triggered the assertion message and allowed me to remove of this backport. But it worked well whilst it lasted.
Thanks to Tom Forbes for the initial implementation of bulk_create(ignore_conflicts=True)
back in Ticket #28668. I hope this helps you speed up any data imports, or write ORM backports,
—Adam
Learn how to make your tests run quickly in my book Speed Up Your Django Tests.
One summary email a week, no spam, I pinky promise.
Related posts:
- How to Make Always-Installed Django Database Instrumentation
- The Fast Way to Test Django transaction.on_commit() Callbacks
- Disable Instrumentation Packages during Tests
Tags: django