The cost of automated end-to-end testing

Creation vs Execution vs Maintenance

7 min readJun 10, 2023

Writing tests is the cheap and easy part

Yes, I know that people who have spent hours attempting to make tests work across browsers and devices will disagree with this statement, so let’s exclude cross-browser and device testing for now. Suffice it to say that most teams over-test in this area. And there are great tools, like Mabl for example, that have achieved the holy-grail we were all striving for: write once and run across all browsers.

Apart from working around various initial technical complications, like connecting to your database or figuring out network configuration and so on, creating tests is generally easy and getting easier each day.

Maintenance, however, is extremely costly. For some reason though, both testers and developers are extremely averse to rewriting tests, even tests that no one really understands. Once you have a test, it’s usually hard to get rid of that test unless you have the mindset that tests are expendable. And they are only expendable if you understand what those tests cover. And very few teams understand their test coverage. And getting code coverage for an application during an end-to-end test run can be a complicated problem to solve.

Cost of Execution over time

Let’s say you write a test in 4 hours, which would likely be a pretty complicated test. If your tester makes $50 per hour, the test costs $200 to build.

Let’s say each run of your test costs 1/1000th of a cent, based on the average discounted cost of 12 cents per-hour for an AWS EC2 instance. That means your test needs to execute 1,666 times for execution to equal the cost of creation. Now, if your test runs just once a day, it would take 4.5 years for the execution cost of your test to equal the creation cost. But no test runs just once per day and stays running smoothly. Tests decay quickly.

So let’s say that your tests run at least once per day outside the daily (more likely nightly) run. This is not an unreasonable expectation if your team is still releasing every other week or even weekly, given that most teams that operate in this fashion have a few days of crunch time where they run all their tests a few times, maybe even twice or more daily.

If we double that execution number, then it now takes 2.25 years for the creation investment to catch up with creation cost. Let’s say that this is reasonable for an API test. On the other hand, Selenium-based UI tests take 30 seconds simply to bootstrap themselves. So for UI tests, let’s say that your UI test takes 60 seconds. Rounding up a bit, it now takes a little over a year to catch up with creation cost. And the longer your test takes to run, the sooner you catch up.

The cost of maintenance

Maintanence cost can vary widely. I think its fair to say that most automators spend at least 8 hours per week in maintenance, which is the cost of 2 aforementioned tests. If a tester owns about 100 tests, which is a reasonable amount of UI tests to own for a senior tester, they are spending just under 5 minutes per test week or 20 minutes per test month per test. At this rate, it would take a year to equal the cost of creation.

If we add the cost of execution to the cost of maintenance (rounding down slightly on execution) it now takes 6 months for test creation cost to equal execution and maintenance. What I’m really saying here is that in the very best case scenario where your tests are super fast and super reliable, you could rewrite all your tests every 6 months and break even.

Enough with the optimism

So if we’re talking about reality, unless you build your tests in some modern frameworks like Cypress or Playwright, your tests might average 3 minutes. We’re now breaking even at 3 months just with maintenance cost. And we’re still running only twice a day, which is probably not realistic. Note: this is without parallelization.

In reality, very few teams are so optimized that they are running on the cheapest instances, as quickly as possible, with hardly more than a modicum of mainentence. In reality, most teams would probably be able to break even with a complete test rewrite every 1–3 months.

But teams don’t rewrite tests this often because many teams believe that having a flakey test is more valuable than no test whatsoever. And so they still want to execute these tests to make sure they have some coverage for something.

The human side-effects of flakey tests

Using the above example, if your team executes a test with a 90% pass rate 3 times a day it will fail or err once every three days. Some teams will simply run that test again, assuming it’s a false positive. And if a team has over 10 of those, it starts to get difficult to track the false positives, so maybe your team creates a system to track flakey tests. Then your team starts depending on that tracking system. Depending on how much time you spend in this effort one of two things happen:

You spend a bunch of time tracking false positives, so much so that you probably started breaking even sooner than specified above.
You miss a false positive and it got into production and you and your team and, depending on the severity, people outside your team spend hours and hours in meeting after meeting discussing the escape. That time was probably better spent in disabling and rewriting your false positive. Also, your company may have lost a lot of money on that escape.

Don’t just take my word for it

Here’s a calculator from QA Wolf that looks at this problem a different way. They come up with slightly more extreme numbers. They say a test costs 2 hours to build (for a contractor) and failures take 1 billed-hour per test to fix. That would mean that if your test fails twice, you may as well rewrite it from scratch.

However, sometimes multiple tests fail for the same reason. Personally, I think that’s a smell; maybe you have assertions in library code, or you need more encapsulation, or you are asserting before the end of the test. Regardless, let’s say that you have a situation where multiple tests fail with the same single error that are all repaired with the same fix, such that your test has to fail six times for the fix-time to equal the creation time. Remember, a test that runs 3 times a day and has a 90% pass-rate fails twice in six days, so your team can run these particular tests, assuming a 90% pass rate, for a little over a month before you get to the point where you are spending more time maintaining than writing.

But images speak louder than words. This is from xkcd.

Two main psychological phenomena are behind the reason we don’t rewrite our tests.

Sunk Cost Fallacy

As we have seen, we don’t understand how much our our tests cost to maintain and execute vs. what they cost to create. I have shown you how to calculate those costs and I encourage to do so. But even with that knowledge, teams may push back on rewriting tests already running, regardless of passing rate, because of all the effort that has gone into creating the current set of original tests.

Risk Compensation Bias

We are hesitant to rewrite end-to-end tests because they give us the illusion of safety. Maybe we inherited a test along with another set of tests and we don’t really understand how they work or what they do and we think we don’t have time to bother to figure that out. Maybe we remember that we had some sort of trouble with writing a given test because the application had testability issues, or the environment was having problems or some business rule prevented us from putting some data into an error state.

If you want to follow my advice and begin rewriting your tests because you see that the negative ROI, I encourage you to walk your team through these errors in reasoning.

It is my opinion that testers don’t advocate for testability enough and this is particularly true for UI end-to-end tests. Testers want to prove their value, so they try to work around problems rather than trying to get help addressing the root cause of their frustrations: the application or the environment. Consequently, their tests become littered with sleeps, retries, convoluted Xpaths, and so on. Such tests become maintenance magnets.

I’m not saying writing automated tests is easy. What I’m saying is that writing the test is less resource-intensive than executing and maintaining it over time. We have a tendency to overestimate the ROI of any given automated test because of the confidence it gives us, even when that confidence is unwarranted.