Photo by Kenny Eliason on Unsplash

Automated test data management smells and anti-patterns

John Gluck
5 min readApr 12, 2023

--

I have outlined some common approaches to test data management for automated tests, mostly so that I can refer to this outline in other blog entries.

Note: None of these approaches are exclusive. The existence of these patterns in your tests demonstrates that testers don’t have or aren’t aware of better ways to maintain data for automated tests and are likely covering up tech-debt or dark-debt.

End-to-end style

Simply put, you use the front-end, or the centralized API mechanism (such as a gateway), to create your test data. On its face, this approach seems like it makes the most sense and appears to be the most expedient and reliable way to create data for tests. You simply mimic what the application is doing to create data in production. What could go wrong?

The flaw with this thinking is that most companies do not have effective ways of isolating applications and data in pre-production environments, so developers and testers step on each others toes all the time, resulting in lots of waste, probably much more than you think. This approach also limits you to the the current version and any associated feature flags, assuming you have a good system for controlling them.

Drawbacks

  1. Particularly if you are using the UI to create test data, you will be subject to any failures in the UI such as A/B testing, feature flags, improperly handled deployments, or misbehaving dependencies.
  2. Furthermore, you will be subject to unnecessary regulation — login, MFA, SSO, Cookie confirmation, CSRF, so on. You shouldn’t have to use that functionality unless you are explicitly verifying it.
  3. You can only create data that is compliant with the business rules. If you were to need to do backward-compatibility testing where you knew the business rules had changed, this approach wouldn’t work. You would likely need to have ways to configure the system to use older versions or resort to directly modifying data to resemble the data shapes from the target version.

Dependent Tests

The main problem with using one test to set up data for a second test is that the person who is doing this is probably doing so to save time. That makes sense for local development or execution. However, one of the best ways to save time in test execution when running in CI is by parallelizing. Most test execution frameworks have standard built-in methods for parallelizing tests and explicitly warn against dependent tests specifically because they don’t support parallelization and test dependencies.

Drawbacks

  1. Cascading Errors— If one test errs, all subsequent tests will likely do so.
  2. Challenging if not impossible to parallelize.

Regular Database Snapshot (usually in combination with ad hoc SQL)

The idea here is that you take a regular snapshot of production data, sanitize/mask it and then UPSERT that data into lower environments. It’s a heavy approach and requires coordination between testers, application developers, release engineers, DevOps and the DB team. It isn’t a terrible solution, just a little sub-optimal.

Drawbacks

  1. Requires upfront investment and several cycles of guidance and tweaking to make this automatic. That time could be spent on a better solution.
  2. Encourages bad habits — If you know a particular data shape will appear freshly in the pre-prod DB every two weeks, you don’t have to spend time optimizing your tests. However, if that DB refresh is called off for any reason, then your test might get slower due to suboptimal SQL. And what happens if the refresh is postponed several times consecutively? You may run out of data entirely and your query will timeout.
  3. Specifically when using ad hoc SQL, unless you are careful about how your test harness manages database connections, you can easily end up causing performance problems in your lower environments due to poor connection pool management.

Test Data Management Service

Of all the approaches, this is my preferred anti-pattern and, in the absence of better alternatives, I have used this approach to good effect.

The idea is to create a set of endpoints that use various specifiable strategies to encapsulate the insertion of specific data shapes that your test cares about.

The premise here is that ideally you should be able to put the data into the shape you need at the Arrange phase, so that can immediately move to mutate in the Action phase, instead of being forced to comply with business rules (some of which may be asynchronous, or even timestamp-constrained) or even technical rules (such as rate limiting).

Gotchas

  1. There will be a temptation to conflate data coming out of this system as being usable to save time for product owner or developer testing. However automated tests have conflicting requirements with product owners and developers regarding the fidelity of data to production — Developers and, to a lesser extent, product owners require fidelity to production data whereas automated tests need to be free from data fidelity constraints in order to save time and circumvent business rules. If you give into the temptation and modify your service to attempt to accommodate both, you will likely begin compromising your service’s reliability and/or ease-of-use.
  2. It may be possible to use your organization’s services to create the data you need in the state you need it, or modify the freshly created data with ad hoc SQL or a persistence API. It may be possible to directly inject SQL into the state your test needs it to be in depending on your application and/or DB architecture. You will likely need to spend time finding the fastest/most reliable way to do this, which will likely vary from harness to harness.

Drawbacks

  1. Designing, implementing and maintaining a service is a lot of work and over time, it adds up. If you have not built a TDMS before, you may have to fail a couple of times to get a good one. That time would likely be better spent on a more optimal solution. While this is not a complicated problem, it’s more complicated than it looks.
  2. Ideally, this service would be maintained by the testing team at-large as opposed to any one team (such as the Testing Center of Excellence or whatever your company calls that team if it has one). The reason for this is because testers will be closer to the tests and therefore more likely to understand when a contract change will impact this service or when new features will require updates to the service. But maintaining a service across testing organizations requires coordination and, in highly siloed organizations, maintenance on this service can become a hot potato.

Please let me know if you can think of any more test data management anti-patterns and I’d be happy to include them here in this list.

--

--

John Gluck

Quality Practice Lead/TestOps Architect, Dad, Husband, blogger, cat herder, dark debt slayer, enjoyer of strange music and art, yoga enthusiast