Deciding What to Test

Testing proficiency extends beyond the ability to create tests. It is equally crucial to discern what aspects should and shouldn’t undergo testing.

14 min readJan 26, 2024

The ability to understand the app, break down vague testing requirements, and imaginatively explore edge cases are all part of being a tester. All of this leads to a broad question: how do we determine what to test, not only in terms of what but also when and on which platforms?

Combinatorial Problem

A combinatorial problem involves counting or generating all possible combinations of a set of objects or elements based on certain constraints. In the context of software testing, combinatorial problems often arise when deciding what test cases to execute to ensure thorough coverage of different input combinations.
In software testing, applications often have numerous input parameters, and testing all possible combinations of these parameters can be impractical or impossible due to the exponential growth of combinations.

Why not just test everything, on all platforms, as often as possible?

Initially, the answers to these questions might appear straightforward. Why not test everything consistently? Why not outline all conceivable scenarios, code them sequentially, and run them continuously? In theory, this approach would be ideal. Testing aims to assess the app’s quality, and theoretically, more successful test cases would indicate higher app quality. Theoretically, we would prefer receiving this information each time a developer proposes any code changes to the app. Additionally, in an ideal scenario, we would like this validation to occur on every possible device or browser. This, indeed, would be the optimal situation.

Number of Tests — Practical Math

But now let’s do some math and figure out how long this might actually take us. Let’s assume a reasonably simple app for the moment, and arbitrarily stipulate that it has 250 test cases. In my experience this isn’t a huge number, when you factor in a variety of edge cases on top of standard user behavior. Now let’s assume that there are 4 app developers and they each propose 2 changes per day. And now let’s assume that we’re dealing with a mobile app, and we’d like to run it on just the latest two versions each of iOS and Android, with 2 phone and 2 tablet form factors for each of those versions. All of those totals would then be: 250 test cases, across 8 changes per day, across 16 different devices. Multiplying those numbers together we get a value of 32,000 which is the total number of tests we’d need to execute to satisfy this demand. Wow.

If we further assume that each test takes only 30 seconds, which would be very much on the low side for most Appium tests, but achievable as long as all the shortcuts we’ll learn about later are followed, then we can say that to run the entire test suite in all the environments we need for all the changes we want, would take 16,000 minutes per day. Unfortunately, that works out to over 10 times the number of minutes that are actually in the day.

Run multiple tests at once. Make contextually relevant judgement calls about which test cases to run or not.

So, how can we solve this problem? There are a number of ways to work around this basic fact that achieving high coverage takes a lot of time. One of the best ways is to run many tests at a time, rather than only one at a time. But in most cases, we actually make judgment calls about what to test and what not to test, as well as how frequently to run certain tests. The situation is also made somewhat better when we’re starting out building a test suite, because when we begin, we have 0 test cases, and it takes 0 seconds to run them. So it’s up to us how to start. Let’s talk about this first.

Happy Revenue Path

When beginning with an empty test suite, which test cases should take priority? Let’s consider this in terms of two dimensions.

Business Importance

One dimension is the significance of a specific user scenario for us as the developers and testers of the app.

On this scale, one end represents the user experience that generates revenue for our business. If we’re testing an e-commerce app, then the user flow we probably care most about is the checkout flow. If our users can’t successfully pay us money, our app is entirely pointless to us from a business perspective.
The other end of the scale would be things which are less important to our business, like perhaps the ability for users to set their favorite color as a background for the home screen.

UX Frequency

The second dimension concerns the frequency with which a particular variant of the scenario is encountered by users.

At one end of this spectrum lies the typical path users follow through a feature, which is regularly experienced by users.
At the opposite end are highly obscure edge cases that users are unlikely to encounter.

We can envision these two dimensions creating a spectrum, with edge cases for less significant features at the bottom left and common cases for highly important features at the top right. My recommendation would be to begin at the top right and gradually progress downward.

The pathway users traverse to achieve a successful sale, conversion, or action is occasionally referred to as the revenue path of the app. This is because it delineates the route users follow that ultimately results in our company generating revenue (and naturally, you can substitute revenue with any metric your app seeks to maximize).

And for any given feature, the path through that feature with all standard, positive inputs leading to the normal type of conclusion is called the happy path, since it is the path through that feature that you explicitly design for.

For any feature there will be many other paths that could be taken that might result in error messages and so forth, which are entirely valid paths, but not particularly the ones the feature is purpose-built to enable. For example, if we go to check out at an online store, and we enter all correct shipping and billing information, and our items are in stock, and we have an account, etc., then we have gone down the happy path for the checkout feature.

However if we enter the same scenario, but put in invalid shipping information, we’re not on the happy path. This is certainly a scenario that should be tested for, and we should absolutely make sure our app correctly handles the situation where invalid shipping information is entered. But, the most important variant of that scenario to test is the one that we explicitly hope users will find themselves in. That’s the happy path.

Happy Path + Revenue Path

If we merge these two terms to describe the top right of this two-dimensional space, we might refer to it as the happy revenue path. My proposal is that as a product team, each member should possess some understanding, whether implicit or explicit, of how every test case they define relates to this space. Starting with the happy revenue path and expanding outward from there is the most efficient approach.

Why do we focus on this first? Because there are associated costs with writing and executing tests. Automating tests incurs expenses for our company, as does the time and resources required for test execution, including the wait time for developers and product teams to receive test results. Therefore, we aim for our tests to deliver maximum value from the outset, which entails identifying the intersection of value specific to our app as we commence testing.

Naturally, during the process of crafting our happy revenue path test, we may discover the need to create several additional tests. For instance, drafting a checkout test might entail having already written an account creation or login test, as these are prerequisites. However, the primary emphasis remains on the happy revenue path.

Diminishing Returns

Every Unit of Additional Investment Yields Less Return

So what’s the next step? Do we persist in writing tests indefinitely until we’ve addressed every single edge case and achieved flawless test coverage? In an ideal scenario, that would be the approach, but in reality, we frequently confront the challenging reality of diminishing returns. The principle of diminishing returns indicates that for each additional unit of investment, we receive slightly less in return. This principle applies to our investment in our test suite as well, particularly if we’ve followed the approach of prioritizing testing in the highest-value areas first.

When each additional test case costs more to write than the value it provides, it may be the time to stop writing tests.

This is actually a good thing, because it means at some point we will know when it’s a good idea to stop writing new test cases. When the value gained from each new test case diminishes compared to the time and resources invested in writing and/or running that test case, it indicates that perhaps our test suite doesn’t need to expand any further.

Edge cases might be completely valid, but also not worth testing without a proven instance of user activity.

But how can we determine whether a new test case won’t yield sufficient value? It’s not a straightforward question to address because it requires a judgment call considering both the costs and the benefits, which can be challenging to quantify. However, let’s consider a few examples. Suppose we have a text input box in our app, and as creative testers, we contemplate all the non-standard inputs users might enter into that box. We might consider the need for a test covering an input string with a space, one with an accented character, one that’s 50 characters long, or one that’s 100 characters long.

Edge case input can sometimes be combined to reduce the number of test cases.

Now, does the test involving the 100-character string truly provide additional insight compared to the test involving the 50-character string? While it’s theoretically plausible that the app might function correctly with one string length but not another, it appears to be a potential redundancy in testing. If the app is designed to manage arbitrary input of any length, why not utilize a single string that is exceptionally long, containing a space, special symbols, and an accent, all combined? Admittedly, this approach sacrifices some specificity in case of test failure, as we’re uncertain which attribute of our comprehensive string triggered the failure. However, considering the minimal likelihood of failure, it’s likely more efficient to conserve time in our test suite.

**It’s possible to get negative returns from tests — they can cost more that they deliver. Test suite design requires careful thought.**

Keep in mind that running tests incurs costs in terms of both time and money for our company. In fact, it’s conceivable to experience negative returns from tests. If our test suite becomes overly extensive and developers have to endure prolonged wait times for feedback, the tests become futile, and it would have been more advantageous to have fewer tests in a suite that completes in a shorter duration. The key point here is that we must employ critical thinking as we construct our test suites and have a clear understanding of which test cases will deliver the greatest value.

Appium/Selenium functional tests might not be the best way to validate every kind of case.

It’s also important to remember that there are different kinds of testing. Perhaps the use of Selenium or Appium to test 50 different edge cases is not the best idea. Selenium and Appium give incredibly high fidelity testing of user experiences, but they are slow. If what we want is merely to test the logic of the app with numerous kinds of input, then it sounds like a unit test for the internal functionality might be a better place to engage our creativity in thinking of edge cases.

Serialized and Parallelized Testing

Another crucial factor to ponder in this context is time. As previously noted, when a test suite expands, its execution time may become so prolonged that we exhaust our time before needing to rerun it, or it may take so long that it fails to offer sufficient feedback. How might we address this challenge?

The initial strategy to address this issue is to attempt running tests concurrently. Consider a scenario where we execute tests in a sequence, one after another, within a test suite. In this case, the situation resembles the image depicted above. With each new test added, the overall duration of our test suite slightly increases.

But now imagine another picture, where rather than adding a test to the end of our test suite, we add a new execution thread to the suite.

By an execution thread I mean the ability to run a test on a single platform or device. So when we talk about adding execution threads, we mean adding the ability to run multiple tests at the same time.

Imagine a world where each test in our test suite had its own execution thread. Its own Appium or Selenium session, its own browser and device. Then they could all run at the same time. The entire test suite would take only as long as the longest test.

In practice, however, it’s usually difficult to get access to that many servers and devices, and it can also be difficult to manage running that many tests at the same time from the WebDriver client perspective as well. Still, we are going to want to run in as many parallel execution threads as we can.

Breaking up the Suite

Break down one huge test suite into different test suites with different purposes.

Another standard way of solving the problem of really long test suites is to break the test suite up into different smaller suites, which get executed at different frequencies.

Division between Smoke and Regression tests.

One common distinction is between a set of Smoke tests and a set of Regression tests.

The regression suite would be the full test suite, and it would be run at some point before an actual release, but maybe not every day.
The smoke tests would be a much smaller suite that would run on every commit or every change proposed by the developers.

Regression is a full test suite designed to catch any bugs. Executed at least once before release.

What is the origin of these terms? In our industry, a regression essentially refers to a bug, particularly something that previously functioned properly but is now malfunctioning for various reasons. Regressions can occur due to numerous factors. Today’s applications are exceedingly complex, such that alterations in one area of an app can impact entirely different parts that developers may not have considered testing during development. However, an automated regression build, which executes all known test cases, would detect such issues before the app’s release.

Smoke is a much smaller suite that runs daily or on every app change.

Smoke tests earned their name because, supposedly, in the past, when electronic hardware was powered on and didn’t immediately emit smoke, it was considered to have passed the smoke test, implying it wasn’t instantly disqualified. We might also interpret smoke tests in alignment with the proverb, “where there’s smoke, there’s fire”. If a smoke build yields an error, it signals an immediate need for rectification. However, the absence of smoke doesn’t negate the possibility of fire, just as passing the smoke test suite doesn’t ensure full functionality — it’s the regression suite’s role to uncover any issues.

The art is picking the right tests for Smoke or Regression suites.

However, by selecting the appropriate tests for our smoke suite, we stand a good chance of receiving early indications regarding the app’s quality. This often suffices to prevent developers from introducing significant bugs before merging their code into the main development branch. To emphasize, a common approach to addressing the issue of overly extensive test suites is to segment them and execute them at different intervals.

While the combination of smoke tests and regression tests is prevalent, it is not the sole system in use. Your particular team and company may opt for three or four levels of test suites, or even non-overlapping suites tailored to test specific features, or whatever aligns with your needs. Nevertheless, dividing the test suite entails increased complexity in managing the execution of the smaller suites and integrating them into the overall development pipeline.

Choosing Test Environments

The last area for us to consider is the area of platforms and devices. In a phrase, our test environments. Which environments should we pick to test on?

Ideally we would run tests in every possible environment to ensure total coverage. But there are just too many in practice.

Once more, in an ideal scenario, we would encompass all the myriad environments where our app could potentially operate in the wild. Achieving a flawless assurance that all tests succeed across every browser, platform, or device would indeed present a significant challenge! The practical issue lies in the sheer multitude of these environments, with many combinations proving elusive to procure.

Particularly concerning mobile devices, it is economically unfeasible to acquire and sustain multiple instances of each device model, each running the array of supported operating system versions. This problem is well known in the mobile development and testing realm by the name of fragmentation.

Source: Report Finds 24K Distinct Android Devices

Not every device / environment is equally valuable. We can do the environment ROI calculation when we know usage patterns: locations, devices, form factors, platform versions, etc. This helps us narrow down what’s important to test.

Fortunately, not every device holds equal significance within our test suite. We can view this issue as another form of return on investment assessment. It becomes crucial to possess insights into our users, ideally backed by ample usage data. Where do our users predominantly reside? Which devices and operating systems do they favor? Are they early adopters, or do they still utilize older Android versions? Examination of our usage data typically reveals that the majority of users gravitate toward a relatively limited range of devices and operating systems. This insight can guide our selection of devices or browsers for inclusion in our smoke and regression suites.

Even outside of tracked usage data, picking a few popular devices in our region is usually sufficient.

Outside of having this data, just picking a few popular devices in the geographic regions our app has traction in is a good way to start, and settling on the last one to two iOS and Android versions is usually sufficient. Then when it comes time to do a release, expanding our device coverage temporarily is useful to make sure we don’t have any surprises.

With all these considerations, there are no hard and fast rules. Make decisions, get some experiences, and be quick to evaluate and reconsider based on the success of our test program.

I hope you now understand the complexity involved in making decisions about what to test and what to skip, considering various factors. While I’ve provided explanations and examples to assist you, the best approach will always depend on the specific situation. There’s no replacement for practical experience, trying out different methods, observing their outcomes, and adjusting strategies accordingly, while drawing from the ideas discussed here.

Happy testing and debugging!

I welcome any comments and contributions to the subject. Connect with me on LinkedIn, X , GitHub, or IG.