Don’t Automate Test Cases

How directly automating test cases leads to unwieldy and bloated automation suites that provide little or no value.

Blake Norrish
Slalom Build

--

It is a common practice to use test cases as a backlog for test automation. QAs develop test cases from user stories as part of normal testing, then automate those tests. Each iteration, more stories are tested, more test cases are automated, and the suite of automated tests grows larger. Engineering leaders push metrics like “percentage of test cases automated” and congratulate people on that high number. Some teams even employ specialized “automation engineers” whose sole job is to take test cases and automate them.

Unfortunately, automating test cases and pushing percentage-of-tests-automated metrics is a quality engineering anti-pattern and inevitably leads to bloated and hard to maintain test suites that provide little value. While automation is critical for agile delivery, this overly simplistic “automation factory” mentality is not a healthy approach to test automation.

In this article we will show why automation factories are misguided, and describe a better approach to automation development that ensures test automation sustains and accelerates delivery velocity.

The Costs and Benefits of Test Automation

To understand why automating existing test cases is so problematic, we need to step back and review a bit of automation theory. Specifically, we need to dig into the automation costs and benefits, look at the expected value-over-time of automated tests, then look at how the expected value changes with different types of tests. We can then look at how automating test cases using a simple automation factory approach would impact overall test suites.

All automated tests have two types of costs: an initial cost to develop, and the ongoing cost of maintenance. Tests are expected to have some benefit: the difference between the time to execute the test manually, and the (assumed) much quicker automated check. While there are other intangible benefits (it’s more fun, it teaches valuable skills, etc.) we do not need to consider those here.

While this is a massive oversimplification of test automation, it does capture the critical aspects for our purposes — every automated test of every type has both a cost and a benefit, and both are important. As automation experts we are trying to maximize the benefits and minimize the costs.

The most oversimplified analysis of test automation you will ever see.

Here are some things that affect the cost of the test:

  • The existence (or lack) of a test framework into which the test will be added
  • The cleanliness of the existing test framework and suite
  • The ease of and ability to setup test state (eg test data)
  • The availability of an acceptable test oracle
  • The volatility of the interfaces or features the test will interact with
  • The stability of the environment against which the test will run
  • The technical skill of the QAs expected to create and maintain the automation

Here is a partial list of things that could affect the amount of benefit we derive from an automated test:

  • How often the test is expected to run (every commit, daily, per release, etc.)
  • How long the test will be a valid check of the SUT.
  • How expensive (in time, or otherwise) it is to validate the same test manually
  • How error prone the test is to run manually

Both these lists are incomplete, and experienced quality engineers are probably screaming “But what about… xyz!!” Fortunately, an exhaustive list isn’t necessary to show that every test has a litany of factors that contribute to both the expected costs and benefits.

We’ve already established that automation costs can be split between upfront development and ongoing maintenance costs. The benefits also have a time dimension: value from the automated test isn’t realized immediately, it’s accumulated over the lifespan of the test. So the full account of the value isn’t final when it’s created, but rather something that changes over time.

If we graphed this value-over-time for a generic automated test, it would look something like this:

This graph shows a test that initially has a net-negative value: the initial development costs outweigh the benefits in time saved. However, the time saved eventually overcomes the initial cost as well as ongoing maintenance costs to provide positive value. Every test goes through some lifecycle like this based on all the variables listed above.

Depending on initial development costs, maintenance costs, and benefits derived, the test may break even and provide positive value much sooner, or possibly never break even:

The above graph describes a test that was initially providing some value (after initial development, the line slopes up), but then later stops providing value. Perhaps this test stopped being run, or was so flaky that nobody trusted it, or the feature it tested was sunset. Regardless of the reasons, the graph tells us that we would have been better off never automating the test in the first place.

This is the key point: that the value of automated tests is impacted by many variables, and depending on these variables, automated tests can have either a positive or negative lifetime value in terms of overall time saved.

The Many Types of Automated Tests

Next, let’s consider what the value-over-time graphs would look like for different types of automated tests. By types of tests, we mean everything from small, code-level unit tests, slightly larger “social” unit tests, even larger component tests, higher level integration tests, tests that bypass the UI and hit APIs directly, tests that mock the APIs and exercise just the UI, E2E tests that span the full tech stack, etc.

It’s important to note that in complex, modern software there are many, many possible types of automated tests; there are an almost unlimited number of ways you can decompose the system into pieces and attempt to isolate just this piece or that piece. Each of these isolations represents a possible test type.

An exploration of every type of automated test that could conceivably be created against a non-trivial architecture is beyond the scope of this article, but I’d recommend the Practical Test Pyramid and Test Strategies for a Microservice Architecture as good primers. To keep this post succinct, we will only consider the two extremes: the small unit test and the large, full end-to-end (E2E) test.

What would the cost/benefit equation and the graph of the value-over-time look like for a unit test? Some unique characteristics of unit tests:

  • Unit tests can usually be written in a matter of minutes, if not seconds
  • Unit tests are (or should be) immune to external state. Meaning they set up mocks, doubles, stubs, etc. to deterministically control the execution path.
  • Unit tests can be executed in milliseconds, suites of unit tests in seconds.
  • Unit tests will likely be executed thousands of times a day, not only within a CI/CD pipeline for every commit, but also locally by each developer as code is written.
  • Even with 100% coverage, unit tests cannot prove the application actually works as expected, and only validate an incredibly small (usually singular) thing.

Given this assessment, the value-over-time graph of a unit test is probably very different from our generic graph: it has almost no up-front cost, minimal maintenance, and while it is executed all the time, each incremental execution actually provides only a very small amount of value.

The graph would probably look something like:

What about the largest of all automated tests, the E2E? What would the value-over-time graph look like for it?

Some key points about E2E tests:

  • E2E tests (by definition) are impacted by the most state and thus require the most setup and test data control.
  • E2E tests are executed against a full environment. Often parts of this environment are shared.
  • E2E tests often include many (dozens, even hundreds) of serial steps.
  • E2E tests are the slowest of all tests, possibly running for minutes.
  • E2E tests usually have to drive functionality through a user interface.
  • E2E tests are usually executed much later in a CI/CD pipeline.
  • E2E tests are the only type of automated tests that demonstrate the application works as the customer would use it.

Given these points, the value-over-time graph of the E2E would be something like this:

This graph shows a significantly higher up-front cost, initially sending the net value highly negative. However, the continued execution of this test over time eventually allows it to break even, then proceed into positive value.

Again, the expectation that this test will eventually provide positive value is predicated on the assumptions made at the top of this section, things like: how long the test will be used, how often the test will be run, how much confidence we have in the result of the test, how much the test will have to change when underlying interfaces (like the UI) change, how stable the environment we are executing against is, etc. Reaching break-even is never guaranteed.

Where to Automate

The nature of E2E tests makes them costly to create and costly to maintain. They necessarily rely on (or could be impacted by) the most states across the most systems. They are more prone to system timing, synchronization, network, or external dependency issues. They usually drive some or all functionality through a web browser, an interface designed to be consumed by a human, not software. Because they are executed against a full environment, it is more likely that these tests will have to share part or all of this environment with other tests or users, possibly leading to collisions and unexpected results.

All these reasons (and many more) make the large tests most risky, and are only justified because of the complementary large value they can provide: only E2E tests demonstrate the full integrated system working together in a realistic manner.

There are many other types of tests to consider, and within complex systems, it’s highly likely that a lower level type of test could more directly test the functionality in question without incurring the costs associated with higher level (larger) tests.

In other words: do not build an API test against a deployed service instance to validate a piece of logic that could be validated in a unit test. Do not build an E2E test for logic that could be validated by directly calling a single API. Never bring in more of the system than you have to, to demonstrate that something is working. Identify the logic or behavior you need to test and create a test that isolates exactly that behavior. Only use higher level tests to test the actual integration of things, not the logic within those things.

The huge, overarching point is that E2E tests are risky and are rarely the best type of automation to test specific functionality. To put this in terms of the value-over-time graph: always prefer test types that give you the most immediate expected value with the least cost, and be skeptical of large automation efforts only justified by overly optimistic estimations of eventual time-saving benefits.

This type of cost-benefit analysis is exactly the thinking that led to the Automated Test Pyramid concept many years ago. The pyramid advocates that, all things being equal, you generally want much more small-fast-cheap tests and much less large-slow-expensive tests.

Many types of tests. The actual names used for each test type vary significantly between organizations.

While I won’t try to convince you that the shape of your suite must always and exactly be a pyramid (Kent Dodds says it’s a Trophy, James Bach prefers a Round Earth model, Justin Searls says it’s just a distraction), I hope I did convince you that all test automation carries risk, and ensuring tests are created at the appropriate level helps mitigate and control this risk. A huge part of the automator’s job (in collaboration with the rest of the team!) is determining exactly which type of test is appropriate and gives the highest likelihood of lifetime positive value.

Said in a different way: all automation should be treated as an investment, specifically, a risky investment. Each type of automated test represents a different type of risk, and we need to manage overall risk and maximize the value of our investment by continually evaluating the cost and benefits of each type.

Automation Factory and the Top Heavy Suite

Back to the automation factory!

If we think about user stories, acceptance criteria, and the test cases that a quality assurance professional would create from them, what type of automated test do you think these test cases naturally map to? If we just blindly take test cases and automate them, what type of test would typically be created?

With generally accepted Agile documentation techniques, requirements are communicated to testers in higher level, user centric language (often, we even call them user stories). Think “Given-When-Then” stories and “As a... I want... so that...” acceptance criteria.

A very hypothetical Agile user story… what type of tests would be created from this?

Even when stories are sliced horizontally and describe the behavior of a specific component (e.g. an API of a REST service) the requirements will be communicated in “user language” of that component. Thus, the test cases created from this documentation naturally map to the larger, more risky types of test automation.

This is the root problem of the automation factory approach — automating test cases inevitably over-emphasizes large, slow, and expensive tests, because test cases are naturally written in the language of the manual tester. They map to the exact type of test that our value-over-time analysis told us to avoid!

All new functionality is tested with dev unit tests and new E2E tests, leading to a top-heavy and hard to maintain test suite!

A second and sometimes just as powerful driver pushing automators to incorrectly prefer large E2E tests over smaller tests is that it is psychologically reassuring for non-technical people (and, some technical people) to know that test cases have been automated. For example, business leaders can understand test cases because they describe application functionality in language they are familiar with, and knowing that these cases are automatically being checked, reassures them that they won’t be getting 2am calls from angry customers. They get far less reassurance knowing that the dev team has 90%+ unit coverage.

Thus, some people will push percent-of-tests-automated and number-of-tests-automated metrics because it makes them feel more comfortable with the state of testing, not because it is actually a more effective or efficient method for automatically checking system behavior. Pushing every test case to be automated might make you feel safe, but will not create a healthy automation suite.

Symptoms of the Automation Factory

Using the automation factory approach of test-cases-in, automated-test-cases-out automation approach tends to lead to some very problematic but unfortunately common symptoms:

  • Test suites that take many hours to execute, or that can only be run over night
  • Automation teams whose sole purpose is to investigate then triage the failures of the previous execution—which consumes most of their time
  • Test failures where the accepted mitigation is simply to rerun the test until it works
  • The removal of the suite from the CI/CD pipeline, or demoting it to a non-blocking step
  • Test suites where developers avoid or outright refuse to run them because they don’t trust the results
  • Suites with thousands of tests, spread across hundreds of folders (or even different repos!), with duplicated tests, commented tests, and tests nobody knows what they do or how they got there
  • Herculean efforts by automation engineers to manage the bloated suite of tests, or to hide the complexity behind a layer of gherkin (eg: Cucumber, etc.)

All of these symptoms are indicative of a test suite that is not providing value to the team that owns them, which is unfortunately common within development organizations. The suffering teams commendably prioritized test automation within their development process, but approached it with a naïve automation factory mentality.

Healthy Automation

Ok, so how should you approach test automation to avoid the bloated suite?

The need for automation covering new functionality should be evaluated holistically by looking at automation options across every type of test. You absolutely do not want to assume that new functionality necessarily needs a new highest-level automated E2E test. Instead, evaluate how the functionality could most effectively and efficiently be covered over the full set of all test types.

You don’t want to automate test cases, you want to automate functionality. Functionality can be partially described by test cases, but automating every discrete test case into its own automated test, at the level the manual tester would perform it, will never be effective or efficient.

In fact, the most effective way to test new functionality might simply be to update an existing automated test, move the test into a more appropriate test type, or even create an entirely new test type. Don’t forget that as system functionality changes, you should be looking to delete tests as much as you are looking to add them!

Don’t assume new functionality needs new E2E Tests, consider all type of automation!

While the type of test you need will be highly dependent on your system architecture, acceptable risk profile, existing tooling, etc. A generally healthy approach to automation changes would look something like:

  1. Add new tests at the lowest level possible.
  2. Update existing tests to cover new functionality.
  3. Remove any now-obsolete or redundant tests, or combine tests.
  4. Test the general case at a high level, then move permutations of that test into smaller, lower types of test.
  5. Add new, high level E2E tests only if absolutely necessary.
  6. Introduce a new type of automated test if this functionality cannot be covered by any existing type of test.
  7. Modify the system architecture to enable a new type of automated test.
  8. Continually and critically evaluate the health of the full suite of all test types with the full development team.

Point number 7 above deserves special attention as it is a critical difference between healthy automation approaches and automation factories.

As development teams, we must stop thinking about test automation as something that is applied to software only after that software has been built. Instead, automation should be considered a critical part of the software development process itself. Automation must be grown with software and the requirement for automatability should drive system design and architecture just as much as any other design requirement.

Approaching automation in this way will enable smaller, more economical types of test automation unavailable to systems built without consideration for automatability. Healthy architecture is designed to be tested, and the role of the automator is just as much to inform system designers of these requirements as it to write automation after the system has been built.

Understanding that all automation has a cost and those costs create risk, that functionality should be tested with many different types of automated tests, that the challenge of test automation is leveraging these types of tests appropriately to holistically create the most effective and efficient suite of automation, and realizing that automatability is as important in system design as any other requirement — this is the healthy approach to test automation. Creating automation factories and blindly automating all test cases as new top-level automated tests is not.

References:
The internet has a ton of great material on healthy test distribution, types of tests, test investment, and many of the other subjects discussed in this article. Unfortunately, it’s buried in a lot of fluff, uninformed speculation, and marketing material.

Here are some of my favorite resources on these topics:

The Practical Test Pyramid, Ham Vocker
https://martinfowler.com/articles/practical-test-pyramid.html

Testing Strategies for a Microservice Architecture, Toby Clemson
https://www.martinfowler.com/articles/microservice-testing/

The Diverse and Fantastical Shapes of Testing, Martin Fowler
https://martinfowler.com/articles/2021-test-shapes.html

Write Tests. Not too Many. Mostly Integration, Kent C. Dodds
https://kentcdodds.com/blog/write-tests

The Testing Trophy and Test Classifications, Kent C. Dodds
https://kentcdodds.com/blog/the-testing-trophy-and-testing-classifications

Testing Pyramid Ice-Cream Cones, Alister Scott
https://watirmelon.blog/testing-pyramids/

Round Earth Test Strategy, James Bach
https://www.satisfice.com/blog/archives/4947

Just Say No to more End-to-End Tests, Mike Wacker
https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html

Testing vs Checking, Michael Bolton and James Bach
http://www.satisfice.com/blog/archives/856

The Regression Death Spiral, Blake Norrish (yes, me)
https://medium.com/slalom-build/the-regression-death-spiral-18f88b9fb030

Test Cases are not Testing, James Bach and Aaron Hodder
https://www.satisfice.com/download/test-cases-are-not-testing

The Oracle Problem in Software Testing, A survey
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
https://discovery.ucl.ac.uk/id/eprint/1471263/1/06963470.pdf

--

--

Blake Norrish
Slalom Build

Quality Engineer, Software Developer, Consultant, Pessimist — Currently Sr Director of Quality Engineering at Slalom Build.