The Data Engineering Testing Series

The Future Of Test Data

David O'Keeffe
Cognizant Servian
Published in
5 min readOct 26, 2021

--

Welcome to Part 5 of the Data Engineering Testing Series! Hopefully, by now, you have a relatively in-depth understanding of the concept behind this process, and how it works; if you don’t, it’s okay to read on but feel free to use the links below to learn more.

Today we’ll go through the potential of ML to revolutionize the test data management space. While I may brandish the terms ML and revolution around as buzzwords, in this case, there’s some serious merit behind the idea. Everything ML right?

Data Testing Is Painful

For the first-timers, a quick breakdown of the whole series is you should use test-driven development (TDD) as a quality framework for your data engineering workflow. To do that, you should generate data with code and pump it through your test suite of choice.

A major sticking point of the whole process thus far has been the coding bit (part 2 of the series) is an annoyance most would rather forego. The thing is, I view it as a necessary evil, no one is solving the problems of test data management (TDM) overnight (again in Part 2), and having the ability to create your own data is a valuable technique for any data engineer to have. You can always create data that represents the “golden path” of your workflow, pump it through your pipelines, and that’s enough to scrape by in just the beginning.

However, that doesn’t change the facts; it’s tedious work that requires a decent understanding of production data and a skillset perhaps few have. So anything we can do to reduce the pain of this process will remove a massive blocker for the whole idea. Furthermore, the more accurate the data you use for your generation, the better your development workflow, which equals fewer production problems at the end of the day.

ML To The Rescue!

Here comes the revolutionary idea, what if we don’t generate data in the traditional sense using code (like in Part 3) but instead rely on a machine learning model to learn what the dataset is instead. This model then spits out a dataset of n-rows whenever you call it. A true expression of software 2.0, as Andrej Karpathy would put it. The advantages are obvious:

  1. Technically you don’t need to touch the production dataset anymore to do your development. A massive bonus for organizations that don’t necessarily want developers handling sensitive data.
  2. The result is considerably more realistic because a computer can pick up on the statistical properties of the data you’re using better than you can.
  3. The model maintains referential integrity among many sets of tables for you—a difficult task to undertake with bespoke tools.
  4. It is theoretically a quicker and more reproducible workflow than relying on human intuition to produce realistic data for development.

The drawbacks:

  1. How can you guarantee production data is not going to leak between environments? Is it possible to reverse engineer production data from the model?
  2. It potentially could require a ton of computing power, and therefore time to produce the models. You could end up with the same problem of stale data in development but on a slightly different and perhaps more expensive scale.
  3. The models themselves have to be maintained and verified somehow.
  4. It limits your ability to narrow down the scope of your tests or create custom data. For example, you may want to give exception handling for a rare case that’s never been seen in production yet. There’s no way around this as far as I can tell.

The Market Is Booming

With the pros and cons out of the way, unfortunately for myself, I didn’t come up with this idea 😅. In fact, this technique has been around since 2016, but it appears this year to be taking off, with the main players in the game being:

The Synthetic Data Vault. Put synthetic data to work!. Soon to be DataCebo — a clever name that’s for sure (placebo — datacebo… get it?)
Tonic.ai, The Fake Data Company
Gretel

In my ideal world, which is somewhat security conscious, my wish-list for these systems would be:

  • It absolutely can not leak sensitive data to lower environments, so it needs options to detect PII/PHI data and obfuscate it. This would most likely need human intervention to get 100% correct.
  • It needs to have a solid authorization and permissions model to support the above as it will require human access to production data.
  • It runs on a service that is elastic, so I don’t have to worry about scaling compute or storage.
  • It can pick up relationships between tables and even between columns in the same table. This is to keep the fake data consistent.
  • It has orchestration and alerting built-in.
  • It should have an SDK (preferably in Python) to interact with.
  • The models should have metrics around their reliability and accuracy.
  • It keeps a history of the models created so we can replicate the outputs of our pipelines for debugging purposes.
  • The models should produce data quick enough as to not slow down my test suite or my development workflow significantly.

Data Quality Testing Reins Supreme

If any of these products get it right, it can be a game-changer for our TDD technique. I suspect what we would start to do is pivot towards testing on the level that is a bit more like the “data quality testing” we talked about in Part 4 and less of the DataFrame to DataFrame comparisons. This is because, in return for development speed, we have traded off the formality of rigidly defined tests (as our test input is more unknown now). I’m sure most businesses are willing to accept a slightly higher error rate in production in return for a shortened time to value.

Again, thanks for taking the time to get this far and please stay tuned for Part 6, where I’ll go through SDV, Tonic.ai, and Gretel, to assess their strengths and weaknesses.

As usual, you can reach out to me on LinkedIn or in the comments below.

--

--