Test Flakiness in Automation

One prevalent critique of UI tests is their perceived “flakiness.” What does this term imply, and what strategies can be employed to overcome this challenge?

8 min readJan 21, 2024

If you’re well-versed in the realm of test automation, you’ve likely encountered the claim that the UI tests are frequently characterized as unstable. The term commonly used for this is “flaky.” What exactly does flakiness entail, what are its origins, how should we interpret it, and what measures can be taken to address it?

What is Flakiness?

Flakiness is when a test passes one time, then fails another.

When someone refers to a test as “flaky,” they are expressing that the test may pass during one execution and fail during another, even if neither the application code nor the test code has been altered. This can be particularly frustrating — investing time in crafting a test that initially succeeds, only to encounter intermittent failures, be it 1%, 10%, or even 50% of the time when run in a CI (Continuous Integration) system.

“Flaky” or “nondeterministic” means we don’t know what the problem is.

An alternative term for describing this situation is “nondeterministic.” In essence, it means that when we execute our test, there’s uncertainty about whether it will definitely pass or fail, even when the application code and test code remain unchanged.

Let’s pause for a moment. Is it genuinely nondeterministic? After all, we are working with computers, and aren’t all computer actions precisely determined? To a large extent, yes. In fact, when people discuss flakiness or nondeterminism, what I’ve observed is that most often, they simply don’t understand why something seems unstable. There’s no apparent reason to them.

Behind an erroneous test failure there’s always a certain reason, even if we don’t understand that reason yet.

Caution is warranted in this context. The fact that we can’t readily identify a clear reason for a test to exhibit intermittent success or failure doesn’t imply there isn’t one. I’ve witnessed many testers and developers hastily label an Appium or Selenium test as “flaky” without delving into the actual investigation of the root cause. We’ll revisit this aspect shortly. For now, let’s establish a revised definition for flakiness: a flaky test is one that experiences sporadic passes and failures for reasons that are not yet comprehended.

Dangers of Flakiness

So, is flakiness a significant concern? What if a test fails only 1 time in 100? Is it worth investing time to understand why it fails that 1 time?

**Flakiness is a rather significant issue. We don’t want our build to be full of ‘tests that cry failure’.**

The answer is typically yes. Allowing any instability in our build is generally ill-advised. Many of you may be familiar with the story of the ‘boy who cried wolf.’ In our industry’s context, we could refer to the ‘test that cries failure’.

Tests that fail erroneously lead to lack of quality signal and lack of trust in our test suite.

One issue with flakiness is that a flaky test starts to lose its effectiveness in signaling anything about the quality of the feature it covers. When a test has previously failed erroneously, there’s a strong temptation to view any future failures as equally erroneous. In such cases, it might be more beneficial to not have the test at all.

Small insignificant levels of test instability can be magnified under normal CI circumstances.

The other challenge is that even a slight degree of instability in a test gets magnified when considering the broader context of a build that executes numerous tests across various platforms.

Flakiness Math

Let’s run through a bit of calculation. Assume that we have a test suite with 100 different test cases, and each test has a stability rating of 99.9%, indicating it fails only 1 out of every 1,000 times it’s run — seemingly quite stable. Now, picture our build running test cases on 2 Android versions, 2 iOS versions, and 1 phone and tablet form factor for each version. This results in a total of 800 tests run per build, as each test case is executed on 8 different devices.

With each test having a 0.1% chance to fail and a total of 800 opportunities, there’s an 80% chance that at least one test in the build will fail erroneously. Consequently, 4 out of 5 builds will necessitate an investigation into a failure that ultimately proves to be a false positive. Considering the impact on trust, how much confidence do you think the development team will have in a build that appears to fail more often than not, even when nothing is wrong? This illustrates how even minor levels of instability become significant when running tests at scale.

Causes of Flakiness

We’ve discussed the risks associated with flakiness, but what about its root causes? As we delve into the unknown and invest time in uncovering the underlying reasons, we discover a range of potential solutions.

Race Conditions

One of the primary causes of instability that turn out to be fixable from the perspective of the test code is the category of the race condition. A race condition is when our test assumes that the app is in a certain state, but the app has not yet reached that state (or has reached it and already moved on). The use of explicit waits to check for various app conditions is usually a good strategy, assuming we’ve clarified the exact nature of the race condition.

Bad Test Assumptions

More generally, there is the category of bad test assumptions. This is when our test code assumes something which is just not the case. Race conditions are only the main example. There are others, for example assuming a certain screen size. This matters especially on Android, where elements that are not on the screen actually don’t exist. Our test might reliably find a certain element over and over again, and then when run on a different device type, not be able to find the element at all. Running the test on a variety of platforms and device varieties before committing it to our build is a good way to flush out these kinds of assumptions. This is a practice that we call “test hardening”, meaning allowing it to become more robust by adjusting it in the face of lots of different environments.

External Instability

Another cause of test instability is instability in environments that are external to the test. One of the benefits of UI tests is that they are tests that take place in totally real-world environments. But one of the drawbacks is that the real-world environment is not often as clean and predictable as we would like.

For example, many mobile and web apps rely on various WebServices to function correctly. But what if there are issues with these backend WebServices unrelated to the UI we’re trying to test? These issues could show up as all kinds of errors. And if the WebService is under our team’s control, that would be a good type of bug to catch. But there are also many 3rd party WebServices that apps rely on, whether it’s a CDN to deliver assets, or some kind of 3rd party API. What if one of those happens to be down, or decides to block our requests because we have 100 tests trying to hit the API at the same time? There are all kinds of cases like this that leave us confused.

The ideal solution to this kind of environmental instability is to find a way not to rely on an external environment at all. Basically, if we can find a way to create our own fake/mock test environment that our app runs completely within during the course of the test, then we can address this kind of instability.

App Flakiness

The last cause of test flakiness that we’ll look at is a bit subtle, and we could call it “app flakiness”. The purpose of a test is to find bugs in an application. But what happens if an app has bugs that don’t reliably show up? Apps are so complex nowadays, it’s not uncommon to hear of a bug that shows up semi-randomly. We might have a rock-solid test that fails periodically for no other explanation than that the app didn’t do the right thing, even though in manual testing the app cannot be seen to exhibit that behavior at all. If you find yourself in this case, you’ve done a great service for your team! In a way, running many tests of an app are a form not just of functional testing but of stress testing, and we can often flush out rare bugs just by running a lot of tests. The solution to this kind of problem is of course to fix the app.

General Recommendations

Take a zero-tolerance approach to flakiness as far as possible.

In essence, my recommendation is to adopt a zero-tolerance approach to test flakiness. This doesn’t imply having a flawless test suite, free from occasional erroneous failures. Rather, it emphasizes the importance of not accepting unexplained failures. Given the high fidelity of UI testing, there will inevitably be environmental failures beyond our control. However, it is crucial to accurately discern failures that are within our control. The only way to ensure this is by understanding the root cause of the failure. As testers, it is our responsibility to make strides in that direction, even if it means collaborating with app developers to gain a deeper understanding.

Use best practices from the beginning.

Certainly, leveraging the best practices is crucial when writing Appium and Selenium tests. These practices aid in addressing common causes of instability right from the start.

Put processes in place to protect the build from slow contamination by flakiness.

For any remaining challenges, implementing certain measures can provide insights into the stability of our tests. For instance, consider establishing a system where newly added tests undergo an initial phase of quarantine or purgatory. During this phase, these tests are executed multiple times and must meet a specified success threshold before being permitted to contribute to a build. This approach can effectively prevent flaky tests from contaminating your build in the first place. I highly recommend incorporating such a practice.

Alright, we’ll conclude our discussion on test instability for now. Stay tuned for additional practices that can be implemented to ensure speedy and dependable tests overall!

Happy testing and debugging!

I welcome any comments and contributions to the subject. Connect with me on LinkedIn, X , GitHub, or IG.