Statistical magic spells to automate performance test result analysis.

Joey Hendricks

Senior Performance Engineer at Algemene Pensioen Groep (APG)

Published Feb 7, 2023

During the 2022 holiday season, my friends and I decided to have a nineties-themed movie night, armed with a lot popcorn, we watched some "older" films. To start the night off, we chose some of our favorite childhood Christmas-themed movies such as "Home Alone" and "Die-hard." As the evening progressed, we wanted to finish with one of our all-time favorites, "Matilda" (1996).

While watching the scene where she discovers her psychokinetic abilities, I couldn't help but daydream about the convenience of having such powers. Imagine being able to eat cereal without using your hands, tidy up the room, and watch your favorite TV show all at once.

This led me to consider automation in the context of performance engineering, specifically the task of analyzing results. I've observed that one of the major challenges in achieving reliable, fully automated performance testing is automating the manual analysis of test results. Often, the examination of these results still requires manual intervention, which hinders the fast pace needed to keep up with the increasing demands of the online world for faster, automated delivery.

In this article, I aim to provide a step-by-step guide to help you automate the task of performance test result analysis for use within a CI/CD pipeline, giving you the power to perform this task with ease, just like eating cereal hands-free. By the end of this article, you will have the skills to analyze performance tests automatically with the help of statistical methods.

Code examples and additional resources on this topic can be found on my GitHub. I would like to extend a big thanks to Mark Dawson, Jr. from JabPerf Corp for his valuable feedback and review.

Crap in, crap out, don't mess with the delicate machinery.

Data is important in any algorithm that makes smart decisions based on measurement. When using these decision-making machines, what you decide to put into them needs to be the proper structure and type of data. It has the same real-world effect as when you decide to feed petrol into a diesel car and act surprised when your car starts acting "funny".

When your sample is only an array of averages, or it is highly aggregated than in most cases, the data in your sample could be useless and the statistical machinery will start to sputter and output gibberish. To give you a visual example of how easily a sample that is based only on averages, can fool you or an algorithm that expects raw measurements, can be seen below:

No alt text provided for this image — The same performance test result but one is only represented as a scatterplot and an average line graph.

When viewed in the above animation, the first graph appears to showcase a certain level of stability within the system under test. However, upon closer inspection of the raw data of this test, discrepancies and patterns emerge that may suggest underlying problems which would require further investigating.

The main point to be drawn from this is that averaging or aggregating samples can be deceptive leading towards a biased analysis and wrong conclusions if the underlying raw data has not been reviewed.

While there are exceptions where data aggregation may be vital, such as with extremely large data sets, in these cases, it's still important to retain a significant number of data points for an accurate statistical analysis, despite the potential for skewing caused by aggregation or averages.

Demystifying the wonderful magic behind the statistics.

When comparing two sets of measurement data from two separate load tests, how would you approach a comparison? Your mind might go back towards the nightmares of university stats classes and remember the student's T-test. However a T-test assumes both samples are Gaussian (normal) distributions, which response times are typically not. Since response time distributions are usually long-tailed, right-skewed (negative skew), and multi-modal.

We need a holistic statistical test or metric that we can use to compare non-normal distributions - these are called non-parametric statistical tests. Two such example of these types of tests are the Kolmogorov-Smirnov Distance and the Wasserstein Distance. They provide us with an distance metric that can represent the amount of statistical distance between two samples.

So what is a statistical distance? In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two arrays of measurements. Those numbers would need to be structured into a "statistical object". Within statistics This can be everything from two random variables, a population or a wider sample of points or two probability distributions like an Empirical Cumulative Distribution Function, ECDF.

Empirical Cumulat... Distri.. Func.. on, What is an ECDF?

The Empirical Cumulative Distribution Function is used to describe the probability associated with a function. In other simpler words within our context, It shows the likelihood of how often a system would produce a certain response time.

The ECDF chart shown above provides an basic example of this concept at work. As it estimates the Cumulative Distribution Function by plotting the measurements from the smallest to the largest values. This helps to visualize the distribution of the feature across the entire dataset.

Pre-processing before raw data is used in your delicate statistical machinery.

It is advisable to perform pre-processing on your raw data to improve its compatibility with the statistical machinery you will use. Begin by removing outliers, as they can significantly affect the overall understanding of the data. Outliers are a small, exceptional set of measurements that can skew the big picture.

It's fair to analyze them separately from the main sample after the initial automated analysis through a simpler comparison method. Typically, I eliminate everything above the 95th percentile, but you can choose any percentile that you prefer. The separation can be done in the following manner:

I also like to standardize my numbers to a common scale. To achieve this, I often apply normalization using the min-max method. This step is optional as there are cases where this is not necessary, but I prefer to normalize the data to ensure that every analysis is conducted within the same scale. To normalize the data, I use the outlier-filtered sample as input for a min-max normalization function. An example of such a function is shown below, if you wish to perform normalization as well:

Once the pre-processing of the raw data is complete, we can proceed to calculate and fit the data into an ECDF. This can be achieved through the following example:

After calculating the data, your sample should have a structure similar to the ECDF example we looked at earlier. The beauty of these statistical objects is that you can choose the size of your sample, giving you complete control to create a rolling baseline mechanism or static baseline, for example.

Typically, you may have only one benchmark sample to compare however normally you must have more stable baseline samples available to you. As the person who creates the sample, you have the ability to choose the specific datasets for constructing the baseline and benchmark ECDF disregarding there sample size.

Now, with two ECDFs, one for the baseline and one for the benchmark, the real work begins. To analyze these, we need to understand two important distance statistics that we can calculate over and ECDF: the Wasserstein Distance and the Kolmogorov-Smirnov Distance, which we will briefly explore.

Roll up your sleeves and get the shovel it is earth-moving time.

The Wasserstein Distance, also known as the Earth Mover’s distance, can be challenging to understand due to its formal complexity. However, it can be easier to grasp by using its physical real-world interpretation.

Imagine you have two piles of dirt, one pile is of a larger size than the other. Your boss asks you to make both piles the same size. The amount of dirt you need to move from one pile to the other to make them equal is the Wasserstein Distance. This distance measures the amount of "work" required to transform one pile into the other.

The Wasserstein distance provides a measure of the difference between two probability distributions in terms of the amount of "work" needed to transform one distribution into the other. The distance is calculated by determining the area between the two ECDFs.

The larger the area, the greater the difference between the two distributions and the larger the Wasserstein distance will be. It is a useful metric for performance analysis because it can be used to compare two different test results and provide a measure of how different they are.

The formal equation to calculate the Wasserstein Distance is quite complex and it reads as follows:

Here is an example code to automate the calculation of Wasserstein Distance::

The mathematical formula behind the Wasserstein Distance is not necessarily essential to understand, however, the significance lies in comprehending the results it produces. This number signifies the amount of effort needed to convert one performance test result set into another while taking into account the entire test result data set.

An increase in this number is an indication of a larger difference in surface area between the two ECDFs, which can result from the natural fluctuations in the application, network, and database latency or regression that is introduced by a defect. Hence, a perfect Wasserstein distance value cannot be zero, but it should be as close to zero as possible to reflect the stability of the application.

The Wasserstein Distance represents stability in the application. However, if the area between the two ECDFs is focused on a particular part of the graph, this measurement might not detect a significant change and write it off as a normal swing in the application latency behavior. To prevent this from happening, one can use the Kolmogorov-Smirnov Distance, which can combat this problem.

Busting out the Kolmogorov-Smirnov Distance to combat spikes.

The Kolmogorov-Smirnov Distance is a distance metric calculated by using the Two-Sample Kolmogorov-Smirnov Hypothesis . This metric represents the largest absolute difference between two cumulative distribution functions (CDF). This is a valuable metric for performance engineers as it quantifies the maximum distance between two tests.

The Kolmogorov-Smirnov distance helps determine if the difference between two tests is within a normal range or if it has changed too much to be considered normal. This metric is an effective tool for identifying unusual differences between tests results in the form of spikes.

The Kolmogorov-Smirnov Distance was calculated by comparing two distributions, represented by the red and blue lines. This distance metric, represented by the number 0.207, is the maximum difference between the two cumulative distribution functions. The formula to determine the Kolmogorov-Smirnov Distance can be found as follows:

It is possible to automate this task like below using code that is available on my repository:

A powerful wombo combo to start automating manual analysis.

Having familiarity with the Wasserstein and Kolmogorov-Smirnov Distance as metrics to measure differences between two sets of test results is a key factor in the process to automate a comparison. I believe that both these distance metrics compliment each other and pick up where another is weaker in picking up difference. With this statistical understanding, we can develop an algorithm to process these metrics and determine an estimate of the difference between the benchmark results and the established baseline.

The solution we are creating may seem to be fitting for trendy buzzword terms like AI or machine learning, but I disagree that they are applicable. Instead, our focus is on using robust statistical methods to measure change and then using those calculations to inform a heuristic that can make sound decisions about the outcomes of changes. This is a more precise statistically driven approach, to determine if the impact of a change was good or bad.

To become able to differentiate between good and bad, a series of critical values is necessary to categorize test results from poor to excellent. This requires a significant amount of performance test data, both for the application performing well and poorly. With this data, calculations and manual verification can be performed to establish what is considered good or bad performance. (If you have not been consistently storing your performance test results, I would suggest reading an earlier article I wrote that outlines my perspective and philosophy on this matter.)

Establishing or modifying these critical values may require experimenting with stable performance tests. I have established default thresholds within my project that you can use as a starting point. However, keep in mind that these thresholds are based on the data I had available, so it is advisable to verify the classifications you wish to apply to certain ranges or critical values.

This experiment involves gradually introducing changes to two stable performance tests, allowing for the identification of values that may be considered excessive. For instance, if two identical samples are subjected to an experiment where the benchmark performance is artificially degraded while the baseline sample remains stable, the results will reflect this decline in performance the animation below illustrates this experiment:

In the animation, the top right corner above the legend displays the amount of distance introduced as a percentage, which is then distributed across the entire dataset. The bottom of the animation shows the increasing Wasserstein and Kolmogorov-Smirnov Distances for reference. It's entirely up to you to determine which distance values should be assigned to each category and how much regression you're willing to tolerate. The categorization is based on your interpretation of what you or your organisation would consider being too much difference between the lines.

Ranking our newly discovered distance statistics.

With the information obtained from the experiments and manual analysis, we can determine our own threshold for what is considered too much distance between two performance tests. This information can then be used to create a table of critical values, which can be categorized using a ranking system such as Japanese letter grades (S to F) or a score from 0 to 100, similar to what you would find in video games.

By using these values in a heuristic, we can estimate the letter grade by using our critical values and use them to determine the success or failure of a pipeline. A sample representation of these critical values and possible per category actions is shown below:

The ranks should be determined based on what is considered acceptable and unacceptable levels of performance degradation for the particular application. The choice of whether to release automatically on which rank depends on the business and ultimately depends on the specific risk tolerance of the organization regarding performance and stability.

You can view the animation below to understand how much the curves need to change to result in a change of rank according to our heuristic.

Ranks are an effective way of categorizing test results and making the performance impact understandable to others, but they can also create blind spots in automated analysis as they do permit some level of regression.

To overcome this issue, a stricter threshold such as a score is a better solution, as it allows for a more precise determination of the amount of regression between test results. An example of scoring test results can be seen in the animation where the benchmark test is gradually degraded while the identical baseline test remains stable.

Summarizing and concluding this longer than the usual article.

In conclusion, I want to express my gratitude for taking the time to read my article. I hope to have convinced you of the power of statistics and how it can be applied in performance engineering. By properly implementing this method, we can automate the repetitive task of performance testing and make the result analysis more reliable and structured on solid statistics.

The concept behind this analysis is simple once you understand the statistics and have experimented with it. Additionally, this analysis can be combined with other checks such as detecting outliers, error rates, test duration, and throughput.

Automating performance testing result analysis has a lot of potentials and can free up performance engineers to focus on more pressing issues and support DevOps teams in interpreting these complex results on their own.

If you found this article interesting, you can learn more by checking out my GitHub repository. If you have any questions or want to contribute, please don't hesitate to reach out to me, I am always eager to help expand this topic.

References.

Below are references to great articles that are relevant to this topic and helped me create this solution:

Ganesh Padmanaban

Performance Engineer - ALTERSIS

Impressive and Great article with detailed information Joey Hendricks

1 Reaction

Pieter Beckers

Performance Engineer at Algemene Pensioen Groep (APG)

Great article Joey

2 Reactions

Stijn Schepers

Chief Digital Officer (CDO)

Great article Joey Hendricks Glad you understand the true value of raw data. Saying so, this is especially useful for automated regression testing. There is also a risk related to data … you can use it to zoom into a point you want to make without looking at the complete picture from a holistic point of view. Always look at data in the first place objectively. Do you agree maestro ?

Statistical magic spells to automate performance test result analysis.

Joey Hendricks

Senior Performance Engineer at Algemene Pensioen Groep (APG)

Crap in, crap out, don't mess with the delicate machinery.

Demystifying the wonderful magic behind the statistics.

Empirical Cumulat... Distri.. Func.. on, What is an ECDF?

Pre-processing before raw data is used in your delicate statistical machinery.

Roll up your sleeves and get the shovel it is earth-moving time.

Busting out the Kolmogorov-Smirnov Distance to combat spikes.

A powerful wombo combo to start automating manual analysis.

Ranking our newly discovered distance statistics.

Summarizing and concluding this longer than the usual article.

References.

More articles by this author

Insights from the community

Others also viewed

Analyzing Performance-Testing Results: Knee Point Analysis

Offshore Mooring System Market

Monolithic vs Microservices Architecture | Case Study of Netflix and Atlassian

Netflix: A Case Study Customer Focus, Innovation, Global Reach, and Brand Reputation.

Quality Assurance & Software Testing Trends for 2024 – The Year of QA Done Right

Building an Internal Performance Testing Team

Before you automate your performance testing…

Things you forget to check about CPU in performance testing

"Spell of Technical wizard on your development team"

The ripple effects of early Design and Architecture decisions !!

Explore topics

Crap in, crap out, don't mess with the delicate machinery.

Demystifying the wonderful magic behind the statistics.

Empirical Cumulat... Distri.. Func.. on, What is an ECDF?

Pre-processing before raw data is used in your delicate statistical machinery.

Roll up your sleeves and get the shovel it is earth-moving time.

Busting out the Kolmogorov-Smirnov Distance to combat spikes.

A powerful wombo combo to start automating manual analysis.

Ranking our newly discovered distance statistics.

Summarizing and concluding this longer than the usual article.

References.

Reimagining Efficiency: A Tale of Homes, Cloud and Performance Engineering

Nov 6, 2023

Should we be hoarding gold like a dragon?

May 23, 2022

How to get a better understanding of your performance test results

Jul 15, 2020

Don’t lose your mind over slow code check your performance sanity

Jun 11, 2020

Stumbling Head First Into Performance Engineering

May 6, 2020

Insights from the community

Others also viewed

Analyzing Performance-Testing Results: Knee Point Analysis

Offshore Mooring System Market

Monolithic vs Microservices Architecture | Case Study of Netflix and Atlassian

Netflix: A Case Study Customer Focus, Innovation, Global Reach, and Brand Reputation.

Quality Assurance & Software Testing Trends for 2024 – The Year of QA Done Right

Building an Internal Performance Testing Team

Before you automate your performance testing…

Things you forget to check about CPU in performance testing

"Spell of Technical wizard on your development team"

The ripple effects of early Design and Architecture decisions !!

Explore topics