Complex Django filters with Subquery

Django’s Subquery expression is one of the most powerful tools in the ORM. The documentation does a good job of explaining it at a high level, but unless you have a decent understanding of SQL, you may not fully understand it’s power.

The primary use case that I’d like to cover is filtering a model based on a very nested relationship’s state. The purpose here is to show something complicated. If it doesn’t make sense, give it a little time, play around with it in your local environment and ask me follow-up questions!

Note: I’ll be using ipython and django-extensions to use python manage.py shell_plus --print-sql, then formatting that SQL manually with sqlformat.org.

Example app set up

Let’s start with our models:

class Member(models.Model):
    """A basic user/member model"""
    name = models.TextField()
    
    def __str__(self):
        return self.name

class Category(models.Model):
    """A basic categorization model for posts"""
    name = models.TextField(unique=True)

class Post(models.Model):
    """A basic post model"""
    created = models.DateTimeField(auto_now_add=True)
    updated = models.DateTimeField(auto_now=True)
    author = models.ForeignKey(Member, related_name='posts', on_delete=models.CASCADE)
    category = models.ForeignKey(Category, related_name='posts', on_delete=models.CASCADE)
    title = models.TextField()
    body = models.TextField()

class Comment(models.Model):
    """A basic comment model"""
    created = models.DateTimeField(auto_now_add=True)
    updated = models.DateTimeField(auto_now=True)
    author = models.ForeignKey(Member, related_name='comments', on_delete=models.CASCADE)
    post = models.ForeignKey(Post, related_name='comments', on_delete=models.CASCADE)
    body = models.TextField()

Now let’s create some data via a shell:

Member.objects.bulk_create([
    Member(name="alex"),
    Member(name="chris"),
    Member(name="devon"),
    Member(name="sam"),
])
alex, chris, devon, sam = list(Member.objects.order_by('name'))

Category.objects.bulk_create([
    Category(name="cats"),
    Category(name="dogs"),
])
cats, dogs = list(Category.objects.order_by('name'))

def create_post_and_comments(post_author, comment_authors, category):
    post = Post.objects.create(
        author=post_author,
        category=category,
        title="...",
        body="...",
    )
    Comment.objects.bulk_create([
        Comment(author=author, post=post, body="...")
        for author in comment_authors
    ])

create_post_and_comments(alex, [devon, sam], dogs)
create_post_and_comments(chris, [devon, sam, chris], cats)
create_post_and_comments(devon, [devon, sam], cats)
create_post_and_comments(devon, [devon], cats)
create_post_and_comments(devon, [sam], cats)
create_post_and_comments(sam, [sam], cats)

Complex filtering

Let’s find the members who have commented on cat posts that aren’t their own.

A common approach would be to look for the filters that are valid, then fetch the Members who match:

from django.db.models import F

members = set()
comments = (
    Comment.objects
    .filter(post__category__name="cats")
    .exclude(post__author=F("author"))
    .select_related('author')
)
for comment in comments:
    members.add(comment.author)
print(members)
# {<Member: devon>, <Member: sam>}

SQL generated:

SELECT "myapp_comment"."id",
       "myapp_comment"."created",
       "myapp_comment"."updated",
       "myapp_comment"."author_id",
       "myapp_comment"."post_id",
       "myapp_comment"."body",
       "myapp_member"."id",
       "myapp_member"."name"
FROM "myapp_comment"
INNER JOIN "myapp_post" ON ("myapp_comment"."post_id" = "myapp_post"."id")
INNER JOIN "myapp_category" ON ("myapp_post"."category_id" = "myapp_category"."id")
INNER JOIN "myapp_member" ON ("myapp_comment"."author_id" = "myapp_member"."id")
WHERE ("myapp_category"."name" = 'cats'
       AND NOT ("myapp_post"."author_id" = ("myapp_comment"."author_id")))

That works and for small cases it’s great. Plus the SQL is pretty straightforward. However, it’s a bit inefficient because we’re fetching ALL comments then making them unique in the application layer via a set.

We can shove this distinct logic down into the database by performing the select on Member with the following queryset:

from django.db.models import F, Q

members = Member.objects.filter(
    # Using ~Q() here allows us to do an exclude on one-to-many relationship
    # without generating another JOIN to that table in the SQL.
    ~Q(comments__post__author=F("comments__author")),
    comments__post__category__name="cats",
).distinct()
print(members)
# <QuerySet [<Member: devon>, <Member: sam>]>

SQL generated:

SELECT DISTINCT "myapp_member"."id",
                "myapp_member"."name"
FROM "myapp_member"
INNER JOIN "myapp_comment" ON ("myapp_member"."id" = "myapp_comment"."author_id")
INNER JOIN "myapp_post" ON ("myapp_comment"."post_id" = "myapp_post"."id")
INNER JOIN "myapp_category" ON ("myapp_post"."category_id" = "myapp_category"."id")
WHERE (NOT (EXISTS
              (SELECT 1 AS "a"
               FROM "myapp_comment" U1
               INNER JOIN "myapp_post" U2 ON (U1."post_id" = U2."id")
               WHERE (U2."author_id" = ("myapp_comment"."author_id")
                      AND U1."id" = ("myapp_comment"."id")
                      AND "myapp_comment"."author_id" = ("myapp_member"."id"))
               LIMIT 1))
       AND "myapp_category"."name" = 'cats')
LIMIT 21

This gets us the same result, and it’s an actual QuerySet instance rather than a set. This can be useful if we need to do some other ORM operation. However, looking at that SQL, we can see we’re selecting on "myapp_comment" twice (and there’s another join to "myapp_post"). This may not be problematic depending on your database’s resources, but it sure would be great if we could avoid that.

from django.db.models import Exists, OuterRef, Q

members = Member.objects.filter(
    Exists(
        Comment.objects.filter(
            ~Q(post__author=OuterRef("id")),
            author=OuterRef('id'),
            post__category__name="cats",
        )
    )
)
print(members)
# <QuerySet [<Member: devon>, <Member: sam>]>

SQL generated:

SELECT "myapp_member"."id",
       "myapp_member"."name"
FROM "myapp_member"
WHERE EXISTS
    (SELECT 1 AS "a"
     FROM "myapp_comment" U0
     INNER JOIN "myapp_post" U1 ON (U0."post_id" = U1."id")
     INNER JOIN "myapp_category" U4 ON (U1."category_id" = U4."id")
     WHERE (NOT (U1."author_id" = ("myapp_member"."id"))
            AND U0."author_id" = ("myapp_member"."id")
            AND U4."name" = 'cats')
     LIMIT 1)
LIMIT 21

I’m not 100% sure that’s always going to perform better, so it will require some testing and measurement. I’m showing this though because it provides you with another way to define a QuerySet that generates SQL in another way that may perform better.

More power!

Let’s turn up the gas a bit. Let’s add some models to track sharing of content and usage of those shares. This is similar to Stack Overflow where it can measure the impact of your engagement.

from uuid import uuid4

class ShareLink(models.Model):
    """
    A model to track who is sharing what and how often things are shared
    """
    created = models.DateTimeField(auto_now_add=True)
    created_by = models.ForeignKey(Member, related_name='shares', on_delete=models.CASCADE)
    uuid = models.UUIDField(unique=True, default=uuid4)
    post = models.ForeignKey(Post, related_name='shares', on_delete=models.CASCADE)
    comment = models.ForeignKey(Comment, blank=True, null=True, related_name='shares', on_delete=models.CASCADE)

class ShareLinkView(models.Model):
    """
    A model to track who is viewing shared content
    """
    created = models.DateTimeField(auto_now_add=True)
    share_link = models.ForeignKey(ShareLink, related_name='share_link_views', on_delete=models.CASCADE)
    viewer = models.ForeignKey(Member, related_name='share_link_views', on_delete=models.CASCADE)

And generate some data.

def create_share_and_views(created_by, viewers, post):
    share = ShareLink.objects.create(
        post=post,
        created_by=created_by,
    )
    ShareLinkView.objects.bulk_create([
        ShareLinkView(share_link=share, viewer=viewer)
        for viewer in viewers
    ])

alex, chris, devon, sam = list(Member.objects.order_by('name'))

for post in Post.objects.filter(comments__author=sam).distinct():
    create_share_and_views(sam, [alex, chris, chris, chris, devon, sam], post)

for post in Post.objects.filter(comments__author=chris).distinct():
    create_share_and_views(chris, [chris, chris, devon, devon, sam, sam], post)

# Share posts by devon, but only one view was by another user.
for post in Post.objects.filter(comments__author=devon).distinct():
    create_share_and_views(devon, [devon, devon, sam], post)

Ridiculously complex filtering

Our next goal is to find the members who have commented on cat posts that aren’t their own, then shared that post which was viewed two or more times by others. These members are also known as the Chatty Cats.

from django.db.models import F

members = set()
comments = (
    Comment.objects
    .filter(post__category__name="cats")
    .exclude(post__author=F("author"))
    .select_related('author')
    .prefetch_related("post__shares__share_link_views")
)
for comment in comments:
    # Check that the share was by a member other than the
    # commenter and that it was viewed 2+ times by other
    # members.
    for share in comment.post.shares.all():
        # Use created_by_id to avoid a N+1 or select_related
        if share.created_by_id != comment.author_id:
            continue
        views = len([
            view.id for view in share.share_link_views.all()
            # Use viewer_id to avoid a N+1 or select_related
            if view.viewer_id != comment.author_id
        ])
        if views >= 2:
            # Annotate the number of most views
            comment.author.most_views = views
            members.add(comment.author)
print(members)
# {<Member: sam>}
print(members.pop().most_views)
# 5

SQL generated (there’s a lot!):

SELECT "myapp_comment"."id",
       "myapp_comment"."created",
       "myapp_comment"."updated",
       "myapp_comment"."author_id",
       "myapp_comment"."post_id",
       "myapp_comment"."body",
       "myapp_member"."id",
       "myapp_member"."name"
FROM "myapp_comment"
INNER JOIN "myapp_post" ON ("myapp_comment"."post_id" = "myapp_post"."id")
INNER JOIN "myapp_category" ON ("myapp_post"."category_id" = "myapp_category"."id")
INNER JOIN "myapp_member" ON ("myapp_comment"."author_id" = "myapp_member"."id")
WHERE ("myapp_category"."name" = 'cats'
       AND NOT ("myapp_post"."author_id" = ("myapp_comment"."author_id")));


SELECT "myapp_post"."id",
       "myapp_post"."created",
       "myapp_post"."updated",
       "myapp_post"."author_id",
       "myapp_post"."category_id",
       "myapp_post"."title",
       "myapp_post"."body"
FROM "myapp_post"
WHERE "myapp_post"."id" IN (8,
                            9,
                            11);


SELECT "myapp_sharelink"."id",
       "myapp_sharelink"."created",
       "myapp_sharelink"."created_by_id",
       "myapp_sharelink"."uuid",
       "myapp_sharelink"."post_id",
       "myapp_sharelink"."comment_id"
FROM "myapp_sharelink"
WHERE "myapp_sharelink"."post_id" IN (8,
                                      9,
                                      11);


SELECT "myapp_sharelinkview"."id",
       "myapp_sharelinkview"."created",
       "myapp_sharelinkview"."share_link_id",
       "myapp_sharelinkview"."viewer_id"
FROM "myapp_sharelinkview"
WHERE "myapp_sharelinkview"."share_link_id" IN (19,
                                                23,
                                                25,
                                                20,
                                                26,
                                                21);

While we don’t have a N+1 situation, we are generating 4 SQL queries and the Python code is a bit complex. It may be possible to reduce some of this, but ultimately we want to see if we can execute one query to get the results we’re after.

The requirement of finding at 2+ views means we’re going to need to count something in an annotation.

from django.db.models import Count, F, OuterRef, Q

nested_share_qs = (
    ShareLink.objects.filter(
        # The ShareLink should be from the member
        created_by=OuterRef("id"),
        # Filter to shared posts where the sharer commented on the post.
        post__comments__author=OuterRef("id"),
    )
    # Exclude posts that were authored by the sharer/commenter
    .exclude(post__author=F("created_by"))
    # Limit the GROUP BY to only the id column rather than all columns of ShareLink
    .values('id')
    .annotate(
        # Count the number of views by people other than the sharer/commenter
        views_count=Count(
            'share_link_views',
            filter=~Q(share_link_views__viewer=F("created_by"))
        )
    )
    .order_by('-views_count')
)

members = Member.objects.annotate(
    most_views=Subquery(nested_share_qs.values('views_count')[:1])
).filter(most_views__gte=2)
print(members)
# <QuerySet [<Member: sam>]>
print(members[0].most_views)
# 5

The SQL here is too much for --print-sql, so you’ll need to run print(members.query) in the shell.

SQL generated:

SELECT "myapp_member"."id",
       "myapp_member"."name",

  (SELECT COUNT(U6."id") FILTER (
                                 WHERE NOT (U6."viewer_id" = (U0."created_by_id")
                                            AND U6."viewer_id" IS NOT NULL)) AS "views_count"
   FROM "myapp_sharelink" U0
   INNER JOIN "myapp_post" U2 ON (U0."post_id" = U2."id")
   INNER JOIN "myapp_comment" U3 ON (U2."id" = U3."post_id")
   LEFT OUTER JOIN "myapp_sharelinkview" U6 ON (U0."id" = U6."share_link_id")
   WHERE (U0."created_by_id" = ("myapp_member"."id")
          AND U3."author_id" = ("myapp_member"."id")
          AND NOT (U2."author_id" = (U0."created_by_id")))
   GROUP BY U0."id"
   ORDER BY 1 DESC
   LIMIT 1) AS "most_views"
FROM "myapp_member"
WHERE
    (SELECT COUNT(U6."id") FILTER (
                                   WHERE NOT (U6."viewer_id" = (U0."created_by_id")
                                              AND U6."viewer_id" IS NOT NULL)) AS "views_count"
     FROM "myapp_sharelink" U0
     INNER JOIN "myapp_post" U2 ON (U0."post_id" = U2."id")
     INNER JOIN "myapp_comment" U3 ON (U2."id" = U3."post_id")
     LEFT OUTER JOIN "myapp_sharelinkview" U6 ON (U0."id" = U6."share_link_id")
     WHERE (U0."created_by_id" = ("myapp_member"."id")
            AND U3."author_id" = ("myapp_member"."id")
            AND NOT (U2."author_id" = (U0."created_by_id")))
     GROUP BY U0."id"
     ORDER BY 1 DESC
     LIMIT 1) >= 2

It could be argued that this members QuerySet is more difficult to understand than the looping, prefetch logic to filter data. However, it limits the data being fetched from the database to exactly what’s needed. The looping, prefetch approach ends up selecting the full objects for all the comments’ posts, share links and share link views.

The generated SQL here looks like a lot, but this is partly because we’re annotating the Subquery in the result. Things would be dramatically simpler with the following:

members = Member.objects.filter(
    Exists(nested_share_qs.filter(views_count__gte=2))
)
print(members.query)

SQL generated

SELECT "myapp_member"."id",
       "myapp_member"."name"
FROM "myapp_member"
WHERE EXISTS
    (SELECT 1 AS "a"
     FROM "myapp_sharelink" U0
     INNER JOIN "myapp_post" U2 ON (U0."post_id" = U2."id")
     INNER JOIN "myapp_comment" U3 ON (U2."id" = U3."post_id")
     LEFT OUTER JOIN "myapp_sharelinkview" U6 ON (U0."id" = U6."share_link_id")
     WHERE (U0."created_by_id" = ("myapp_member"."id")
            AND U3."author_id" = ("myapp_member"."id")
            AND NOT (U2."author_id" = (U0."created_by_id")))
     GROUP BY U0."id"
     HAVING COUNT(U6."id") FILTER (
                                   WHERE NOT (U6."viewer_id" = (U0."created_by_id")
                                              AND U6."viewer_id" IS NOT NULL)) >= 2
     LIMIT 1)

Lessons learned

Hopefully, this gave you a better insight into what the ORM is doing when you generate QuerySets. This used several advanced ORM usages, so if you saw something you’re not familiar with, please go look it up in the docs or play around with it.

One thing I’d like to call out explicitly is the lookups on F() expressions and OuterRef() expressions. These are wildly powerful. That’s because you can use them to filter on relationships that aren’t from a related field such as ForeignKey. For example, we had this python code

.exclude(post__author=F("created_by"))

This allowed us to filter on two models that were joined together by Member, but without having to join to the member table explicitly. Understanding that concept unlocks the ability to do complex things with the ORM.

Other things we touched on:

And that’s it! Let me know what you think or if you have any questions. You can find me on the Fediverse, Django Discord server, or reach me via email.