Test Flakiness: Improper Use of Timeouts

Common anti-patterns and how to address them

5 min readApr 7, 2023

For some time now, I have been closely helping folks fix flaky Android tests. Consequently, I have learned quite a bit about how they affect Dev workflow, how they increase cost, how to mitigate them, and how to eliminate them. This post is an attempt to share some of my learning.

Flaky tests

Flaky tests yield different verdicts when repeatedly executed in “identical” environments.

Here are a few examples of flaky tests.

A test that does not use network connectivity passes when executed in an environment with network connectivity but fails when executed in an environment without network connectivity.
A test that passes when executed under normal system load but fails when executed under higher system load.
A test that depends on interface P passes when executed with library X but fails with library Y when both libraries expose the same interface P.
A test that passes when executed on an emulator but fails on a physical device.

The above examples hinge on “what does an environment encompass?” and, consequently, if two environments are identical.

In the rest of this post, I will focus on a specific case of the third example that involves functional tests and timeout limits.

While the following observations resulted in the context of testing Android apps, I believe they are applicable to distributed systems as well.

Timeouts

Many tasks in environments such as Android involve timing constraints, e.g., a content provider initialization should complete in 10s. Similar constraints are typical in the presence of communication between independent entities such as threads and processes, e.g., use of opFuture.get(5, SECONDS) to wait for 5s for the result from an asynchronous computation or use of latch.await(10, SECONDS) wait for 10s for a notification from a concurrent computation.

Tests involving such communication often rely on timeouts to wait for tasks to complete in a definite period.

The primary reason to use such timeouts is to ensure the tests do not execute forever due to runaway computations. (A secondary reason is to check for the absence of an event, and I’ll get to this later in this post.)

From timeouts to flaky tests

Suppose a test starts an asynchronous operation op and waits for its result via a future opFuture. Often, to prevent the test from waiting forever, we wait on opFuture for 10s, e.g., opFuture.get(10, SECONDS).

This test can be flaky in two possible ways.

Suppose op guarantees to complete execution in a finite amount of time; say, op will complete in less than 10s under average system load. The test can fail when the load on the test environment (e.g., Android virtual device) is much higher than the average.
Suppose op does not provide guarantees about its execution time. The test can intermittently fail because the assumption used in the test is incorrect: op will complete within 10s and opFuture will yield a value within 10s.
In this case, when a test fails because op timed out, we cannot be sure if the test failed for a valid reason or due to lack of execution time.

Can we eliminate such flakes?

Yes. We can eliminate the first kind of flakes by ensuring the test environment honors the needs/expectations of APIs used in the UUT. Consequently, we need to be more diligent about the needs and guarantees of APIs and the guarantees of the test environment.

To avoid the second kind of flakes, assuming we cannot establish guarantees/contracts for APIs, we can use more permissive timeouts, e.g., opFuture.get(30, SECONDS). However, it is not a good solution because it can break as the unspecified timing behavior of APIs change.

Instead, again assuming we cannot establish guarantees/contracts for APIs, indefinite waiting (i.e., without timeout; e.g., opFuture.get()) is a better solution as it does not make any assumptions about the execution time of the operation op.

However, using indefinite waiting has two downsides.

Indefinite waiting can lead to prolonged executions of broken tests. We can avoid such executions by establishing a test environment-wide limit on the time allowed for any test execution (not individual operations/steps in a test), e.g., five minutes for small tests and 10 minutes for large tests.
With time-limited test executions, test failures stemming from indefinite waiting will disguised as test infra failures, i.e., assertion failures in tests are cast into test infra timeouts.

In case of single-threaded concurrency

In single-threaded concurrency, multiple logical threads execute concurrently on the same platform threads.

In such a setting, when a logical thread has to wait on another logical thread, using indefinite waiting (e.g., opFuture.get()) without timeout will lead to a deadlock.

To avoid such deadlocks, we can realize indefinite waiting by combining finite waiting and infinite retries while yielding control.

while (!opFuture.get(10, MILLISECONDS))
  eventLooper.loop();

Don’t disguise performance tests as functional tests

The above observations are specific to functional tests. They do not apply to performance tests.

That said, a common anti-pattern is to disguise a performance test as a functional test, i.e., conduct performance testing in a test environment that does not provide strong guarantees about performance-related characteristics related to CPU cycles, memory, and load. For example, conducting performance testing on an Android virtual device without ensuring the virtual device exhibits performance characteristics of the target execution environment, say, a physical device.

We can address this mistake in two ways.

Conduct performance testing in a representative execution environment, e.g., use an actual device instead of a virtual device.
Confirm if the test needs to be a performance test. If not, convert it into a functional/behavioral test by changing timing constraints on operations into completion or order constraints on operations. For example, if operation op2 depends on the completion of operation op1, then wait for the completion of op1 before checking on the status of op2.

Don’t test for absence of events

A typical pattern to test for the absence of an event is to wait a finite amount of time for the event to occur and then conclude the event’s absence if the event does not happen in that period. This pattern works if the test environment represents the target execution environment and the operations leading to the event provide timing guarantees.

When the above constraints are not satisfied, an alternative pattern is to identify and test for the occurrence of “proxy” events that signify the event of interest will not occur. In such instances, rely on infinite polling and notification instead of timed waiting for events.

Summary

Improper use of timeouts due to unfulfilled expectations or a non-representative test environment can lead to flaky tests. We can eliminate such flakes by

establishing guarantees,
using indefinite waiting,
combining infinite retries and finite waiting,
switching from timing constraints to order constraints, or
using a representative test environment and establishing contracts for APIs.

Since the above fixes are general, they can be used in the context of testing mobile apps, distributed systems, and UI interactions as well.