Why you should think about automated test data management before you design your database

5 min readMay 2, 2023

Save yourself money, time, SDETs and heartache

Much of my thinking on this subject, though obtained from the school of hard knocks, is echoed in this Google Cloud research paper. It’s nice to have validation.

One of the most challenging aspects of automated software validation, and particularly live system integration validation, is test data.

Before I go any further, I want to distinguish between 3 kinds of Test Data Management (TDM):

For data pipelines — for the purpose of this blog, I will not concern myself with this.
For UAT — It is important to understand how data for automated tests and for UAT differ. I will explain this further in the post that I link to below that talks about anti-patterns.
For automated tests, particularly those that are executed in the post-merge stage of the CD pipeline.

The problems with test data for automated tests

Let me start by emphasizing the standard test automation principle that any automated test should verify one and only one objective, transformation, intention or mutation at a time. I don’t want to quibble about what to call it. I think all four words suffice for the purpose of clarifying the standard.

If your team doesn’t specifically architect or design for the eventuality that your automated tests will need to be able to easily inject data in any state into your database, that eventuality doesn’t just disappear. You’ve just swept it under the rug. You have shifted that responsibility onto whomever is validating your app.

You will undoubtedly need to manage test data at some point in your application’s lifespan, but if you don’t specifically design for the case, then you will likely end up with a test data management design that will force your test automators to adopt anti-patterns, because at that point, truly fixing the problem will seem prohibitive from the perspective of time, cost and general enthusiasm.

To circumnavigate the problems, your testers will likely use one or more of the following anti-patterns.

End-to-end style — Using your highest abstraction layer.
Dependent tests — Chaining tests to gradually mutate state.
Regular production database snapshotting — Usually used in combination with ad hoc SQL.
Test data management service — Using a service as a common interface, usually as a gateway to or orchestrator for other services.

Each one of these patterns incurs both obvious and hidden costs, so in combination, the costs multiply. But testers who want to do something about it are generally stopped by pessimism, specifically that the reason they have to go through so much effort to create data for tests is because the database wasn’t designed for the task. This maybe because the database is monolithic, can’t be recreated easily from scratch, and/or the application wasn’t built to have its database swapped out for testing purposes.

So what’s the ideal?

As best as I can determine, having seen this work and having confirmed this pattern with other people (like Francis Upton) who have developed the same notion independently, the approach is rather simple.

Follow best practices by making it so there is a 1:1 relationship between services and database schemas.
Implement a “minimal state” for your database that you can use so that all your services have required data for operation and no more, as if the application were brand new.
Design your application in such a way that you can swap your regular database with the “minimal state” database when you are executing tests.
Maintain seed data that corresponds to tests and receives appropriate updates when the schema changes.

The result will be a dearth of anti-patterns and a drastic reduction in automated test maintenance and stability. In other words, you won’t notice that you are free of a problem you will never have.

It is difficult for me to convey how critical such a change is and it is even more difficult to put a dollar amount on the savings. That said, it is pretty common knowledge these days that monolithic databases have big problems. What’s not known is that fixing your monolithic database to be less monolithic or even less restrictive opens up vast cost savings.

A real-life example

At one company I worked for, I arrived just as they were fixing their database monolith. They were doing so in order to get better performance and quality from their data pipeline, but their solution also fixed test data management.

This company was a startup, and they didn’t have time to rewrite all their applications, so they did the next best thing; they wrote a regular process to completely rewrite their database from the ground up by first removing the constraints, adding the data and then re-adding the constraints. This process took a while to perfect, about six months if I am remembering correctly, but the results were astounding from a testing perspective. Instead of having to identify and manage data in the production database in order to test, testers could now could simply write generators add new seed data to the minimal set.

Take my word on this

It is difficult to measure the value of the absence of something. It is difficult to estimate the waste created by a particular design decision when trying to view that output in isolation. How does one determine which design decisions contribute to which wasteful activities and to what extent?

All I can tell you is that after observing these patterns over years in several companies and having received independent confirmation, the approach here yields huge savings in maintenance and saves teams lots of time, headache and frustration. It reduces the risk of automated testing activities because it makes the most difficult part automated of testing much easier to manage, and in doing so, increases the availability of cycles to assess risk.

Tell me more

Feel free to reach out to me and tell me about your approach to this problem, how you fixed it, and maybe even how it’s better than mine.