Test code should rarely be resilient

Fail fast. Faster! Faster!!

4 min readAug 3, 2023

One of the key differences between web application code and the automated test project code that accompanies it is that application code typically executes as a service whereas test project code executes as a script. This is significant because it implies a few structural differences.

All automated test code is structured thusly:

Setup — Configure initial state which may be common to several tests
Arrange — Get to where you need to go to start modifying state
Act — Modify state
Assert — Did the state modify (as expected?)
Teardown — Return to the state before setup

For expedience and conservation of resources, tests should not trap exceptions that result from errors. Exception handling in tests is dodgy; there are only a few justifications for it.

Known race conditions that aren’t a result of application flaws— An example of this is an AJAX spinner. Good frameworks now handle this without exceptions but this is an acceptable way to handle it if you aren’t yet on a modern framework. But, seriously, get your team on a modern framework.
Setup and Teardown stage — Often you will need to trap exceptions to determine the state of the data.
Easier debugging when there are multiple exceptions for different reasons — In this case, you can add a message and re-throw the exception.
Trapping an expected error to assert against it — This is rarely a valid thing to do for anything that’s not a unit or integration test.

Automated tests should fail or error fast. They should not take longer to fail or error than they do pass. They should only validate one state transformation/component interaction at a time; this means they shouldn’t not have multiple asserts nor should they be able to fail for multiple reasons — a given test should only ever fail for a single reason; it found a defect in the feature.

While retries are a good way to keep an application resilient, retries are often a smell in test automation and can be indicative of lingering application or environment problems. While some modicum of retrying is expected as a result of state polling, many popular modern testing frameworks have such resilience built-in.

However, the fact that your tests are retrying steps can be an indicator of problems in the application or the environment. Maybe the local environment is sporadically slow due either to the SUT or a dependency thereof. If an automator finds themselves doing wholesale test retries, it can be an indicator of possible pipeline, infrastructure or even test data management issues. Or it could even be that there’s an actual problem in the application that isn’t getting fixed because the development team doesn’t consider the automated test user a full-fledged user.

My point here is that by building retry resilience into test harnesses, test automators risk masking under lying problems that would be solved sooner were they surfaced to the team.

The same holds true with timeouts in test projects. While servers and batch processes use timeouts to increase resiliency, timeouts in test automation can be indicator of application and environmental problems, temporary or otherwise. Furthermore, timeouts are particularly pernicious because once an automator increases a timeout, s/he rarely goes back to find out if it still needed. Instead, it stays, adding time to every run of the test it is in.

Use failures to your advantage

It seems basic to me to have to state this so forgive me if I’m being too obvious. Rather than trap errors, some of your tests, or even your test setup, should potentially act as gates to the execution of others. For example, if all of your tests need to log in to your site and log in is broken, none of your tests will pass. It makes sense to not even try to run them because, unless your company only has one application, every run of your test consumes some amount of shared resources.

Most modern test frameworks have a built-in skip mechanism that will do exactly this: prevent a test from executing if some other test or setup step fails. Good automators not only let their test failures bubble up in order to expose the underlying problem, but they predict when certain failures in a given test foreshadow failures in other tests in the project.

If your tests have an automatic wholesale retries of failed tests, you can use that information to gather metrics and, in sophisticated systems, even act on that information. For example, if you are retrying a given test three times and it fails, then passes, then fails, then congratulations. You have a flakey test. You can use this information to automatically disable the flake and notify your team.

But anytime a test that was previously failing passes, you need need to investigate it unless you are sure you know what causes that failure. And if you don’t, you are best advised to disable that test.

Test code should rarely be resilient

Fail fast. Faster! Faster!!

Use failures to your advantage

Written by John Gluck