{# Blog title goes here #}

Django: Getting a full model instance from a Subquery

This one might be a bit niche, but it took me a bit of trial-and-error and I'm hoping that the technique I describe might be generic enough to be reusable. Who knows, if there's enough interest I could even turn it into a library!

EDIT: I've received some positive feedback on Mastodon, so I took some time to package things up and put them on github: django-model-subquery. I'm still not 100% sure about the name, if you've got good ideas let me know (bonus points for puns).

Context

For a bit of context, my current $DAY_JOB involves a lot of converting hand-written SQL into an equivalent (or near enough) Django ORM version. In particular, I've been building a lot of Subquery annotations for querysets and I've come up with a useful technique that I wanted to share.

If you've ever used the ORM's Subquery you might know that it's limited to returning a single column, but I've figured out a technique that lets me get full model instances out of a single subquery. In order for me to explain I think it's easier if I start small and build things up. So first things first, let's start with two convenient and somewhat realistic models:

from datetime import date

from django.db import models

class Book(models.Model):
    title = models.CharField(max_length=100)
    rating = models.IntegerField(blank=True, null=True)
    published = models.DateField(default=date.today)


class Person(models.Model):
    name = models.CharField(max_length=100)
    birth = models.DateField(default=date.today)

For the sake of this example, let's say I want to get a queryset of Person instances, and for every person I want to get the most popular book published the year they were born. I know, I know, it's a little contrived but that's the best I could come up with, so please bear with me.

Attempt #1: Using Subquery

For our first attempt, let's start by creating a custom queryset class for our Person model, with a custom with_book() method that annotates the queryset with the information we want:

from django.db.models.functions import ExtractYear

class PersonQuerySet(models.QuerySet):
    def with_book(self):
        """
        Annotate each person in the queryset with the best rated book of the
        year they were born.
        """
        year = ExtractYear(models.OuterRef("birth"))
        all_books = Book.objects.filter(published__year=year).order_by("-rating")
        best_book = models.Subquery(all_books.values("title")[:1])
        return self.annotate(book=best_book)


class Person(models.Model):
    ...  # same fields as before (name and birth)
    objects = PersonQuerySet.as_manager()

Now when we do Person.objects.with_book() each instance in the resulting queryset will have a book attribute containing the title of the best rated book published the year that person was born (or None if no books were published that year). That's a pretty standard use of Subquery so far, and it works great. But what if we wanted to get both the book's title AND its rating for example? If you're like me, you might try to simply add the extra field to the subquery, like so:

best_book = Subquery(all_books.values("title", "rating")[:1])

But unfortunately, that doesn't work. Django's Subquery can only deal with a single column as far as I can tell.

Attempt #2: JSON to the rescue

This is where I had my first brilliant devious idea. How do you turn two values into a single one? You stuff them both into a JSON object of course!

from django.db.models.functions import ExtractYear, JSONObject

class PersonQuerySet(models.QuerySet):
    def with_book(self):
        """
        Annotate each person in the queryset with the best rated book of the
        year they were born.
        """
        year = ExtractYear(models.OuterRef("birth"))
        all_books = Book.objects.filter(published__year=year).order_by("-rating")
        json_obj = JSONObject(title="title", rating="rating")
        best_book = models.Subquery(all_books.values_list(json_obj)[:1])
        return self.annotate(book=best_book)

Now all my Person instances are annotated with a book attribute that contains a dictionnary with two keys: title and rating. Amazing! But wait I hear you say, why stop only at those two fields? You can easily stuff everything in there, can't you?

You sure can:

json_obj = models.JSONObject(id="id", title="title", rating="rating", published="published")

Works like a charm. Well for some definition of "charm" anyways. JSON only supports a limited number of types and in particular it does not do dates so our published field is returned as a string. That's annoying I guess, but not the end of the world right? Turns out that's fixable but it takes a bit of work, I'll get to it a bit later.

For now, let's fix something that's easy. If you're like me you might have found it annoying having to list all the fields twice. There's gotta be a better way, and indeed that's where the Model._meta API comes to the rescue:

json_obj = JSONObject(**{f.name: f.name for f in Book._meta.get_fields()})

This has all the same limitations as before, but at least it's shorter and it's easily adatable to another model. But what if I told you we could do better? I did after all promise actual model instances in the intro, not just dictionaries.

Attempt #3: JSON + custom output field

I often think of the ORM (and various parts of Django actually) in terms of layers. One layer is my database with its tables, columns, and rows. At the other end there's the Django model layer with Python objects and attributes with rich data types. The trick then becomes finding the right class and the right methods that sits between those layers and that lets me convert things from one layer to another.

With the ORM, this class is often the models.Field and its various subclasses. They're the ones who "know" how to convert from a database column type into an appropriate Python type. And if you've spent any time with Django's database functions you've probably had to sprinkle some output_field=SomeDjangoModelField() to get things working.

And it turns out that output_field is the answer to our current problem as well. The idea is to create a custom field class that can convert the JSON object returned by the database (containing a key/value pair for each field of the model) into a model instance:

class BookJSONModelField(models.JSONField):
    """
    Instantiate an actual Book instance from a JSON object containing its fields.
    """
    def from_db_value(self, value, expression, connection):
        value = super().from_db_value(value, expression, connection)
        # at this point, value is a dict with one key/value pair per field
        return Book(**value)

With this new field, all we need to do is plug it in the Subquery call like so:

best_book = models.Subquery(all_books.values_list(json_obj)[:1], output_field=BookJSONModelField())

And voilà, now our book annotation on the queryset gives actual model instances! Neat, right? There are still a few limitations though. The big one is that we still haven't solved the issue with JSON having a limited number of types: even though we do get a Book instance, if we look at book.published then we see it's still a string and not a date object.

Luckily that's not too hard to solve. Remember when I said fields were the one that could convert between data types? Turns out all fields have a nifty to_python() method that will do the conversion for us:

def from_db_value(self, value, expression, connection):
    value = super().from_db_value(value, expression, connection)
    # at this point, value is a dict with one key/value pair per field
    return Book(**{k: Book._meta.get_field(k).to_python(v) for k, v in value.items()})

So there we have it. Just like I promised: a Subquery that returns actual model instances. I hope you found the journey interesting and that you learned something from it.

Possible further improvements

This technique is of course not perfect, and I'm not even sure it's very useful outside of some very specific situations (though if you do end up using it please do let me know).

One feature that could be pretty neat would be to support ForeignKey fields via some kind of recursive JSON structure. I haven't tried to implement that but I think it should be doable without too much trouble (famous last words)

It could also be cool to not have to load all the fields in the JSON object, and mark missing fields as deferred so they're loaded automatically from the db when accessed on the instance. I've actually implemented a proof of concept of this one and it was suprisingly easy (I've left it as an exercise to the reader 😃).

What do you think? Is that something that would be worth packaging up into a 3rd-party library?


For ease of copy-pasting, here's how it looks like all put together:

from datetime import date

from django.db import models
from django.db.models.functions import ExtractYear, JSONObject


class BookJSONModelField(models.JSONField):
    """
    Instantiate an actual Book instance from a JSON object containing its fields.
    """
    def from_db_value(self, value, expression, connection):
        value = super().from_db_value(value, expression, connection)
        # at this point, value is a dict with one key/value pair per field
        return Book(**{k: Book._meta.get_field(k).to_python(v) for k, v in value.items()})


class PersonQuerySet(models.QuerySet):
    def with_book(self):
        """
        Annotate each person in the queryset with the best rated book of the
        year they were born.
        """
        year = ExtractYear(models.OuterRef("birth"))
        all_books = Book.objects.filter(published__year=year).order_by("-rating")
        json_obj = JSONObject(**{f.name: f.name for f in Book._meta.get_fields()})
        best_book = models.Subquery(all_books.values_list(json_obj)[:1], output_field=BookJSONModelField())
        return self.annotate(book=best_book)


class Book(models.Model):
    title = models.CharField(max_length=100)
    rating = models.IntegerField(blank=True, null=True)
    published = models.DateField(default=date.today)


class Person(models.Model):
    name = models.CharField(max_length=100)
    birth = models.DateField(default=date.today)

    objects = PersonQuerySet.as_manager()

(the above snippet is in the public domain, feel free to adapt/reuse it)