Optimising Performance with a Read-replica Database

At Kogan.com our product catalogue is stored in a Postgres database. To support tens of millions of products across multiple stores we've recently added a read-replica database to our AWS Aurora cluster.

Our use-case for the read-replica is to keep user-facing requests fast by directing less important queries to the read replica. For the most part this means tasks like sending order confirmation emails should prefer to use the read only replica. These tasks generally happen as Celery tasks on dedicated worker processes.

Until recently, using our read only replica database has been opt-in; developers needed to explicitly remember which to use. It worked, but it was easy to forget or worse, use the read replica while processing user requests. Since the read database is under heavier load than our write tier, using it in a request generally penalises response times.

So we wanted to make life easier and thought it’d be great if our Celery tasks just always used the read replica by default. This would mean that we’d no longer have to remember where and when to use the read replica - it would just automagically happen for us.

Django has some documentation on the topic. If you browse it you might notice the section called database routers. The use cases presented involve partitioning different models to different databases. For example, a User model and a related Post model get saved together in database A, but a Product model and a related Price model would be saved in database B.

To enable this kind of behaviour we need to write and configure a routing class. A routing class implements one or more of the following methods:

  • db_for_read
  • db_for_write
  • allow_relation
  • allow_migrate

The read and write methods either return a database alias to be used or None if it’s not relevant to that router. The allow relation and migration methods return boolean values or None.

Given our intent is to use the read-replica for all reads and the write database for writes a naive implementation looks like this:

class ReadonlyRouter:
    """
    Naive implementation for a database router to automatically
    direct all reads to a read-replica database.

    do NOT copy and paste this, it doesn't work!!
    """

    def db_for_read(self, model, **hints):
        return "readonly"

    def db_for_write(self, model, **hints):
        return "default"

Before you make it to the local bar to celebrate your achievement, a colleague might sensibly ask:

  • If this goes horribly wrong, how do we recover quickly?
  • what about transactions?

Ok - what about transactions then? Let’s write a small test:

@override_switch("readonly_db_for_celery", active=True)
def test_worker_router_transactions_atomic(worker_db_router, transactional_db):
    """
    within a transaction _all_ db connections use the write db
    """
    with transaction.atomic():
        sp = StoreProductFactory()
        sp_default = StoreProduct.objects.get(pk=sp.pk)
        assert StoreProduct.objects.get(pk=sp.pk) == sp_default

With the naive implementation, this fails miserably. The issue? Writes don’t become visible to the read replica until after the transaction is complete; so we need to account for that. Luckily for us, a database connection knows it’s in a transaction, so we can just access the “in_atomic_block” on the relevant connection. We’ll make it so that if we’re in a transaction, all connections lead to the write database.

Let’s also address having a kill switch. We use a package called waffle to help us manage feature flags. Encapsulating the logic in a method called skip, we’ll configure this so that the router does nothing if our feature is disabled. It ends up looking like this:

def _skip(self, model):
        if issubclass(model, waffle.models.BaseModel):
            return True
        if not waffle.switch_is_active("readonly_db_for_celery"):
            return True

We need a special case for waffle models because such a check will call our router recursively and create a mess. With these modifications and a few more tests let’s call it a day and claim victory. But that eagle eyed colleague notices another issue and your trip to the bar is postponed. The issue results in a stack trace with the following:

The current database router prevents this relation.

Wait, what? We didn’t even provide an allow_relation method! Our router doesn’t care about this! What happened? I’m not 100% sure - what follows is my best guess. Firstly, it’s not our router that’s preventing it - It’s just that django sensibly decides that two objects from different databases shouldn’t be linked unless a router explicitly says it’s OK.

But how did we get in a position where we tried to link different objects from different databases? My understanding is that for newly created objects that haven’t been saved are considered to have originated from the read database. So we save one object, and it behaves as though it originated from the write database. On saving the related object that hasn’t been saved we try to link them, and things don’t work out. Take that with a few grains of salt - I don’t contribute to the internals of Django nor do I claim to be an expert in Object Relational Mappings (ORM).

Since we’ve essentially got only one ‘real’ database we know that all relations should be allowed. We can alter our allow_relation to pretty-much say yes to everything. Our implementation looks like this now:

class WorkerDbRouter:
    """
    A database router used to automatically select the readonly
    database for read operations.  For use in Celery task 
    processing contexts.

    Router should NOT be used in request contexts for better
    performance.  Django router docs available at:
    3.2/topics/db/multi-db/#database-routers
    """

    def _skip(self, model):
        if issubclass(model, waffle.models.BaseModel):
            return True
        if not waffle.switch_is_active("readonly_db_for_celery"):
            return True

    def db_for_read(self, model, **hints):
        if self._skip(model):
            return None
        if connections[DATABASE_DEFAULT_ALIAS].in_atomic_block:
            return DATABASE_DEFAULT_ALIAS
        return DATABASE_READONLY_ALIAS

    def db_for_write(self, model, **hints):
        if self._skip(model):
            return None
        return DATABASE_DEFAULT_ALIAS

    def allow_relation(self, obj1, obj2, *args, **kwargs):
        if not waffle.switch_is_active("readonly_db_for_celery"):
            return None
        return True

Now, about that beer...

Before we grab that beer, we need to acknowledge a bit of a shortcoming with the approach overall. It’s actually now very easy to write a subtle bug. In some cases we have some logging immediately after task completion that looks like:

model = DbModel.objects.get(pk=some_id)
log.info(model.interesting_method())

The example runs immediately after a creation statement in another method where some_id is created. A caveat of a read-replica is that it isn’t populated instantly - there’s a replication lag between being written and becoming available.

Because of replication lag, the statement above to read objects that were just created is subject to a race condition. If the creation hasn’t finished when we try to read it the code raises an exception. In this case we can explicitly skip our router through the use of ‘using’. So there’s still an element of ‘Developer beware’ but the scope is a little bit smaller and misuse is more discoverable in our monitoring.

With that microscopic caveat in mind, we decide that we’ve done enough testing and it’s time to reap the fruits of our labour.

We puff out our chests, ship the changes and take a deep breath as we enable the feature.

Watching with bated breath our read-replica CPU usage climbs. Our write database CPU falls - We’re winning! It’s amazing!

Until we look at our alert monitoring. It's..... It's not ideal. We have a number of race conditions, as well as what we believe to be some issues manipulating cached objects.

It's not the end of the world - After two minutes of observation we revert the feature and go back to our less interesting routing scheme.

It seems we have some more work ahead of us.

But before then, it’s Friday and I’m overdue for a pint of lager.