Statistical magic spells to automate performance test result analysis.
Don't we all want the ability to move things with our mind?

Statistical magic spells to automate performance test result analysis.

During the 2022 holiday season, my friends and I decided to have a nineties-themed movie night, armed with a lot popcorn, we watched some "older" films. To start the night off, we chose some of our favorite childhood Christmas-themed movies such as "Home Alone" and "Die-hard." As the evening progressed, we wanted to finish with one of our all-time favorites, "Matilda" (1996).

While watching the scene where she discovers her psychokinetic abilities, I couldn't help but daydream about the convenience of having such powers. Imagine being able to eat cereal without using your hands, tidy up the room, and watch your favorite TV show all at once.

This led me to consider automation in the context of performance engineering, specifically the task of analyzing results. I've observed that one of the major challenges in achieving reliable, fully automated performance testing is automating the manual analysis of test results. Often, the examination of these results still requires manual intervention, which hinders the fast pace needed to keep up with the increasing demands of the online world for faster, automated delivery.

In this article, I aim to provide a step-by-step guide to help you automate the task of performance test result analysis for use within a CI/CD pipeline, giving you the power to perform this task with ease, just like eating cereal hands-free. By the end of this article, you will have the skills to analyze performance tests automatically with the help of statistical methods.

Code examples and additional resources on this topic can be found on my GitHub. I would like to extend a big thanks to Mark Dawson, Jr. from JabPerf Corp for his valuable feedback and review.

Crap in, crap out, don't mess with the delicate machinery.

Data is important in any algorithm that makes smart decisions based on measurement. When using these decision-making machines, what you decide to put into them needs to be the proper structure and type of data. It has the same real-world effect as when you decide to feed petrol into a diesel car and act surprised when your car starts acting "funny".

When your sample is only an array of averages, or it is highly aggregated than in most cases, the data in your sample could be useless and the statistical machinery will start to sputter and output gibberish. To give you a visual example of how easily a sample that is based only on averages, can fool you or an algorithm that expects raw measurements, can be seen below:

No alt text provided for this image
The same performance test result but one is only represented as a scatterplot and an average line graph.

When viewed in the above animation, the first graph appears to showcase a certain level of stability within the system under test. However, upon closer inspection of the raw data of this test, discrepancies and patterns emerge that may suggest underlying problems which would require further investigating.

The main point to be drawn from this is that averaging or aggregating samples can be deceptive leading towards a biased analysis and wrong conclusions if the underlying raw data has not been reviewed.

While there are exceptions where data aggregation may be vital, such as with extremely large data sets, in these cases, it's still important to retain a significant number of data points for an accurate statistical analysis, despite the potential for skewing caused by aggregation or averages.

Demystifying the wonderful magic behind the statistics.

When comparing two sets of measurement data from two separate load tests, how would you approach a comparison? Your mind might go back towards the nightmares of university stats classes and remember the student's T-test. However a T-test assumes both samples are Gaussian (normal) distributions, which response times are typically not. Since response time distributions are usually long-tailed, right-skewed (negative skew), and multi-modal.

No alt text provided for this image

We need a holistic statistical test or metric that we can use to compare non-normal distributions - these are called non-parametric statistical tests. Two such example of these types of tests are the Kolmogorov-Smirnov Distance and the Wasserstein Distance. They provide us with an distance metric that can represent the amount of statistical distance between two samples.

So what is a statistical distance? In statisticsprobability theory, and information theory, a statistical distance quantifies the distance between two arrays of measurements. Those numbers would need to be structured into a "statistical object". Within statistics This can be everything from two random variables, a population or a wider sample of points or two probability distributions like an Empirical Cumulative Distribution Function, ECDF.

Empirical Cumulat... Distri.. Func.. on, What is an ECDF?

The Empirical Cumulative Distribution Function is used to describe the probability associated with a function. In other simpler words within our context, It shows the likelihood of how often a system would produce a certain response time.

No alt text provided for this image
An ECDF when represented in a graph.

The ECDF chart shown above provides an basic example of this concept at work. As it estimates the Cumulative Distribution Function by plotting the measurements from the smallest to the largest values. This helps to visualize the distribution of the feature across the entire dataset.

Pre-processing before raw data is used in your delicate statistical machinery.

It is advisable to perform pre-processing on your raw data to improve its compatibility with the statistical machinery you will use. Begin by removing outliers, as they can significantly affect the overall understanding of the data. Outliers are a small, exceptional set of measurements that can skew the big picture.

It's fair to analyze them separately from the main sample after the initial automated analysis through a simpler comparison method. Typically, I eliminate everything above the 95th percentile, but you can choose any percentile that you prefer. The separation can be done in the following manner:

No alt text provided for this image

I also like to standardize my numbers to a common scale. To achieve this, I often apply normalization using the min-max method. This step is optional as there are cases where this is not necessary, but I prefer to normalize the data to ensure that every analysis is conducted within the same scale. To normalize the data, I use the outlier-filtered sample as input for a min-max normalization function. An example of such a function is shown below, if you wish to perform normalization as well:

No alt text provided for this image

Once the pre-processing of the raw data is complete, we can proceed to calculate and fit the data into an ECDF. This can be achieved through the following example:

No alt text provided for this image

After calculating the data, your sample should have a structure similar to the ECDF example we looked at earlier. The beauty of these statistical objects is that you can choose the size of your sample, giving you complete control to create a rolling baseline mechanism or static baseline, for example.

Typically, you may have only one benchmark sample to compare however normally you must have more stable baseline samples available to you. As the person who creates the sample, you have the ability to choose the specific datasets for constructing the baseline and benchmark ECDF disregarding there sample size.

Now, with two ECDFs, one for the baseline and one for the benchmark, the real work begins. To analyze these, we need to understand two important distance statistics that we can calculate over and ECDF: the Wasserstein Distance and the Kolmogorov-Smirnov Distance, which we will briefly explore.

Roll up your sleeves and get the shovel it is earth-moving time.

The Wasserstein Distance, also known as the Earth Mover’s distance, can be challenging to understand due to its formal complexity. However, it can be easier to grasp by using its physical real-world interpretation.

Imagine you have two piles of dirt, one pile is of a larger size than the other. Your boss asks you to make both piles the same size. The amount of dirt you need to move from one pile to the other to make them equal is the Wasserstein Distance. This distance measures the amount of "work" required to transform one pile into the other.

The Wasserstein distance provides a measure of the difference between two probability distributions in terms of the amount of "work" needed to transform one distribution into the other. The distance is calculated by determining the area between the two ECDFs.

The larger the area, the greater the difference between the two distributions and the larger the Wasserstein distance will be. It is a useful metric for performance analysis because it can be used to compare two different test results and provide a measure of how different they are.

No alt text provided for this image
The Wasserstein Distance is graphically shown as the orange part between the red and blue lines.

The formal equation to calculate the Wasserstein Distance is quite complex and it reads as follows:

No alt text provided for this image
Straight out of a Lord of the rings.

Here is an example code to automate the calculation of Wasserstein Distance::

No alt text provided for this image

The mathematical formula behind the Wasserstein Distance is not necessarily essential to understand, however, the significance lies in comprehending the results it produces. This number signifies the amount of effort needed to convert one performance test result set into another while taking into account the entire test result data set.

An increase in this number is an indication of a larger difference in surface area between the two ECDFs, which can result from the natural fluctuations in the application, network, and database latency or regression that is introduced by a defect. Hence, a perfect Wasserstein distance value cannot be zero, but it should be as close to zero as possible to reflect the stability of the application.

The Wasserstein Distance represents stability in the application. However, if the area between the two ECDFs is focused on a particular part of the graph, this measurement might not detect a significant change and write it off as a normal swing in the application latency behavior. To prevent this from happening, one can use the Kolmogorov-Smirnov Distance, which can combat this problem.

Busting out the Kolmogorov-Smirnov Distance to combat spikes.

The Kolmogorov-Smirnov Distance is a distance metric calculated by using the Two-Sample Kolmogorov-Smirnov Hypothesis . This metric represents the largest absolute difference between two cumulative distribution functions (CDF). This is a valuable metric for performance engineers as it quantifies the maximum distance between two tests.

The Kolmogorov-Smirnov distance helps determine if the difference between two tests is within a normal range or if it has changed too much to be considered normal. This metric is an effective tool for identifying unusual differences between tests results in the form of spikes.

No alt text provided for this image

The Kolmogorov-Smirnov Distance was calculated by comparing two distributions, represented by the red and blue lines. This distance metric, represented by the number 0.207, is the maximum difference between the two cumulative distribution functions. The formula to determine the Kolmogorov-Smirnov Distance can be found as follows:

No alt text provided for this image

It is possible to automate this task like below using code that is available on my repository:

No alt text provided for this image

A powerful wombo combo to start automating manual analysis.

Having familiarity with the Wasserstein and Kolmogorov-Smirnov Distance as metrics to measure differences between two sets of test results is a key factor in the process to automate a comparison. I believe that both these distance metrics compliment each other and pick up where another is weaker in picking up difference. With this statistical understanding, we can develop an algorithm to process these metrics and determine an estimate of the difference between the benchmark results and the established baseline.

No alt text provided for this image

The solution we are creating may seem to be fitting for trendy buzzword terms like AI or machine learning, but I disagree that they are applicable. Instead, our focus is on using robust statistical methods to measure change and then using those calculations to inform a heuristic that can make sound decisions about the outcomes of changes. This is a more precise statistically driven approach, to determine if the impact of a change was good or bad.

To become able to differentiate between good and bad, a series of critical values is necessary to categorize test results from poor to excellent. This requires a significant amount of performance test data, both for the application performing well and poorly. With this data, calculations and manual verification can be performed to establish what is considered good or bad performance. (If you have not been consistently storing your performance test results, I would suggest reading an earlier article I wrote that outlines my perspective and philosophy on this matter.)

Establishing or modifying these critical values may require experimenting with stable performance tests. I have established default thresholds within my project that you can use as a starting point. However, keep in mind that these thresholds are based on the data I had available, so it is advisable to verify the classifications you wish to apply to certain ranges or critical values.

This experiment involves gradually introducing changes to two stable performance tests, allowing for the identification of values that may be considered excessive. For instance, if two identical samples are subjected to an experiment where the benchmark performance is artificially degraded while the baseline sample remains stable, the results will reflect this decline in performance the animation below illustrates this experiment:

No alt text provided for this image

In the animation, the top right corner above the legend displays the amount of distance introduced as a percentage, which is then distributed across the entire dataset. The bottom of the animation shows the increasing Wasserstein and Kolmogorov-Smirnov Distances for reference. It's entirely up to you to determine which distance values should be assigned to each category and how much regression you're willing to tolerate. The categorization is based on your interpretation of what you or your organisation would consider being too much difference between the lines.

Ranking our newly discovered distance statistics.

With the information obtained from the experiments and manual analysis, we can determine our own threshold for what is considered too much distance between two performance tests. This information can then be used to create a table of critical values, which can be categorized using a ranking system such as Japanese letter grades (S to F) or a score from 0 to 100, similar to what you would find in video games.

By using these values in a heuristic, we can estimate the letter grade by using our critical values and use them to determine the success or failure of a pipeline. A sample representation of these critical values and possible per category actions is shown below:

No alt text provided for this image
When reading this table keep in mind that both statistics need to fall in the same category if this is not the case the lowest category is selected.

The ranks should be determined based on what is considered acceptable and unacceptable levels of performance degradation for the particular application. The choice of whether to release automatically on which rank depends on the business and ultimately depends on the specific risk tolerance of the organization regarding performance and stability.

You can view the animation below to understand how much the curves need to change to result in a change of rank according to our heuristic.

No alt text provided for this image

Ranks are an effective way of categorizing test results and making the performance impact understandable to others, but they can also create blind spots in automated analysis as they do permit some level of regression.

To overcome this issue, a stricter threshold such as a score is a better solution, as it allows for a more precise determination of the amount of regression between test results. An example of scoring test results can be seen in the animation where the benchmark test is gradually degraded while the identical baseline test remains stable.

No alt text provided for this image

Summarizing and concluding this longer than the usual article.

In conclusion, I want to express my gratitude for taking the time to read my article. I hope to have convinced you of the power of statistics and how it can be applied in performance engineering. By properly implementing this method, we can automate the repetitive task of performance testing and make the result analysis more reliable and structured on solid statistics.

The concept behind this analysis is simple once you understand the statistics and have experimented with it. Additionally, this analysis can be combined with other checks such as detecting outliers, error rates, test duration, and throughput.

Automating performance testing result analysis has a lot of potentials and can free up performance engineers to focus on more pressing issues and support DevOps teams in interpreting these complex results on their own.

If you found this article interesting, you can learn more by checking out my GitHub repository. If you have any questions or want to contribute, please don't hesitate to reach out to me, I am always eager to help expand this topic.

References.

Below are references to great articles that are relevant to this topic and helped me create this solution:

Ganesh Padmanaban

Performance Engineer - ALTERSIS

1y

Impressive and Great article with detailed information Joey Hendricks

Pieter Beckers

Performance Engineer at Algemene Pensioen Groep (APG)

1y

Great article Joey

Stijn Schepers

Chief Digital Officer (CDO)

1y

Great article Joey Hendricks Glad you understand the true value of raw data. Saying so, this is especially useful for automated regression testing. There is also a risk related to data … you can use it to zoom into a point you want to make without looking at the complete picture from a holistic point of view. Always look at data in the first place objectively. Do you agree maestro ?

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics