Test Flakiness in Automation — Continued

9 min readJan 31, 2024

If you’ve been in QA or functional automation for some time, you most definitely have heard of a term ‘flaky test’. There’s a lot that gets wrapped up into this, so it’s important, any time we have a conversation about reliability and repeatability of tests, to have a good understanding of what flakiness means and how we should think about it.

Range of Flake Principles

In a prior post, we’ve discussed the definition of flakiness, its causes, dangers, and so on. Now let’s talk about flakiness with reference to two ends of the spectrum in attitudes towards flakiness.

Retry a test n-number of times if it fails. If it eventually passes, that’s fantastic! Move on.

On one hand, we can have an attitude that if a test fails, we should just rerun until it passes. If it does ultimately pass, when we rerun it, great! We call it a flaky test, but it has still passed. That’s good enough for us, let’s add it to the build, and move on. So we really merely take this language of flakiness into our vocabulary and we allow it into our builds. And we have some kind of threshold for how many times a test is allowed to fail before we quarantine it. But we sort of live with this as a fact of life.

Are we adding a new test to the build? Run it 100 times. If it fails even once, the test is rejected.

There’s another approach to flakiness on the other end. It tries to disallow any type of flakiness at all. One typical scenario I’ve come across involves companies running Selenium or Appium tests numerous times before merging them into the main/master branch and incorporating them into subsequent builds. This process helps establish a flakiness profile for these tests. If any flakiness is detected, in certain instances, the tests may be rejected and returned to the author with an explanation that they cannot be merged into the build. This may also affect the inclusion of the associated feature.

These represent two extremes in approaches or attitudes toward flakiness. In the first scenario, where flakiness is deeply integrated into the team’s vocabulary, one technical approach to addressing this reality is by prioritizing flakiness as a primary concern, essentially treating it as a first-class citizen within our test framework.

Flakiness as a Primary Concern

Here’s some example output of tests that’ve been run. There are some dots … that mean passing tests, F’s that mean failures, and E’s meaning that perhaps something blew up or crashed. Something that can be done here is adding a new category of a test result. And we can make a flake a first-class category, when it comes to test reporting. We’d add a K symbol, which stands for a flake.

This approach can prove highly beneficial. The method of implementation varies depending on your test runner. Generally, we delve into the internals of our test runner and either monkey patch or hook into the failure routines. Instead of immediately flagging a failure on the initial attempt, we retry or replay the test a set number of times before acknowledging a failure. Otherwise, we categorize it as a flake.

Make a conscientious effort to monitor statistics related to flaky tests. This way, even if a test doesn’t fail in a specific build, we can refer to a database of statistics in the future and identify it as a problematic test. This information allows us to make informed decisions, such as either removing the test from our build or addressing and fixing the issues associated with it.

This approach is flexible, acknowledging that flakes are inevitable. We aim not to be excessively rigid but to retain the capability to address them later with informed decisions if necessary.

Tests That Signal False Alarms

From the other end of the attitudinal flake range, there are some problems with this.

The test flaked = The test passes and fails, and we don’t understand why

One problem lies in our understanding, where we use phrases like ‘the test flaked’, when in reality, the test intermittently passes or fails without a clear reason. We lack comprehension of the underlying nondeterminism in this test, leading us to label it as flaky. Consequently, we resort to meta-level strategies to handle these flakes. But should this truly be our approach?

Ignorance ≠ Bliss

In this case, remaining unaware of the situation is likely unwise.

What if a flaky test is telling us the app is flaky? Do we want our users to ‘retry’?

What if this unpredictability stems from the app itself? In essence, what if this test is indicating that our app is inherently unreliable? Consequently, we’re delivering an unstable app to our users. It’s possible that if users consistently ‘retry’ using our app, it might eventually function. However, is this the type of app we want to ship?

Only tolerate flaky tests when their causes are completely understood and beyond our control. Don’t hesitate to delve a few levels deeper for a thorough explanation.

The main point to emphasize in reconciling various approaches to flakiness is the importance of accepting only the flakiness that is fully understood and beyond our control. Occasionally, when our app depends on internet connectivity, network disruptions may occur, leading to test failures due to unloaded network resources. Such failures aren’t indicative of problems within our tests, Selenium/Appium, or our app itself; rather, they stem from the inherent nature of the internet. Unfortunately, we have limited control over internet reliability. However, it’s worth exploring potential solutions, such as running tests in environments with more stable networking profiles or mocking out external services in a test version of our app to minimize reliance on the internet. These solutions become apparent only when we have a thorough understanding of the underlying problem.

It’s advisable to tolerate only the flakiness that is fully comprehended and unchangeable. Testers often lack the knowledge, tools, or access to source code to delve deeper into the root causes of nondeterminism. Nevertheless, organizations should empower and encourage exploration of such explanations. Understanding the causes of nondeterminism, whether within our tests or our app, is crucial. By comprehending the underlying issues, we can devise effective strategies to mitigate flakiness and enhance the reliability of our testing processes. Thus, organizations should foster an environment that promotes thorough exploration and understanding of flakiness to facilitate informed decision-making and improvement initiatives.

Flaky Test Steps

There’s a temptation to enclose individual test steps that seem unreliable in a retry block.

I’ve noticed that some testers focus on specific test steps when discussing flakiness. For instance, they might identify a particular element as flaky and suggest enclosing only that specific step within a retry block, rather than the entire test.

Occasionally, this is acceptable — refer to the post on waits.

At times, it’s acceptable. In fact, the WebDriver API incorporates an element of this functionality. It involves waiting for a specific application state, such as the appearance of an element or another condition. I’ve written a comprehensive post on this topic, so feel free to delve into it at your convenience.

Only proceed with this approach if you thoroughly comprehend the root cause of indeterminism and recognize its inevitability.

It’s important to do this if we completely understand what’s happening. Perhaps, something in our app dynamically changes at a time interval that we can’t predict due to some sort of external factor, which our test has no access to. And we have to just live with it. That might mean retrying a bunch of times. But only do this in situations where you know what’s going on and you have a reason to think that retrying is actually helping you, rather than masking over some lower-level nondeterminism.

Otherwise, we’re merely applying a temporary band-aid on instability, which will likely result in repercussions later, either within our app or during our build process.

Basically we don’t want to put a band-aid over things we don’t understand, because we’re building instability into our test suite and we’re going to pay for it later either by missing actual problems in our app or by making our build so brittle and prone to failure that we’ll stop trusting it.

Vital Flakiness Points

Functional tests occur within systems that are inherently dynamic.

The functional tests that we write take place in inherently dynamic systems.

Expecting hermetic compartmentalization and reliability akin to unit tests is a formula for inevitable frustration.

We’re never going to have a completely hermetically sealed environment like we do with unit tests. Expecting that we’re going to have that level of stability is just bad expectations for functional testers.

Change the approach: automate selectively while delivering substantial value. Establish a foundation of highly effective tests, ensure their robustness, and then gradually expand them as we gain insights into the specific aspects of our system.

Reverse the scenario and stop blaming the functional tests for being flaky and slow, but ask what kind of value we can get out of functional tests. Taking a stance of having a million functional test, just like we do with unit tests, is a wrong approach.

The right approach is to ask ‘What are the most important user flows in our app? How can we write a simple, valuable, robust, rock-solid functional tests for just the select few flows that start to deliver value in terms of catching bugs?’. Then expand outward from there.

“A smart incremental approach to writing customer tests that guide development is to start with the “thin slice” that follows a happy path from one end to the other. Identifying a thin slice, also called a “steel thread” or “tracer bullet”, can be done on a theme level, where it’s used to verify the overall architecture. This steel thread connects all of the components together, and after it’s solid, more functionality can be added.”

https://www.amazon.com/Agile-Testing-Practical-Guide-Testers/dp/0321534468

By spending a lot of time on a few tests and understanding where our app or system is likely to be flaky, because of the particular setup that we have, We’ll then be able to have our library of helper methods or utilities that reduce flakiness for us. And we’ll be able to apply them to new tests, rather than just building lots of tests on a flaky foundation.

In this post, we’ve delved further into the exploring and addressing test automation challenges such as flakiness. We’ve covered the spectrum of flake principles, flaky test steps, vital flakiness points, and so forth. The topic of creating and maintaining robust, reliable, and repeatable test suites is rather vast. Stay tuned for future posts!

Happy testing and debugging!

I welcome any comments and contributions to the subject. Connect with me on LinkedIn, X , GitHub, or IG.