How do we measure the Performance of a Microservice, Service, or Server?

6 min readJan 3, 2024

I have repeated the following thoughts too many times in the last few years and decided to write this up. If I have missed something, please add a comment.

Understanding Server Performance

Let’s first holistically understand the performance of a service.

Characteristic Performance Graphs of a Server

The above graphs capture the characteristic behavior of a server, which is measured through throughput and latency as seen by the server’s client.

Latency measures the end-to-end time wait time as seen by the client. We determine latency by measuring the time between sending a request and receiving the response. Latency is measured from the client machine and includes the network overhead.
Throughput measures the number of messages a server processes during a specific interval (e.g., per second). Throughput is calculated by measuring the time taken to process a set of messages and then using the following equation.

Throughput = number of completed requests/time to meet the requests

It is worth noting that these two values are often loosely related. However, we cannot directly derive one measurement from the other.

Performance is a dynamic measurement, unlike static server capacity measurements (e.g., CPU processing speed, memory size). Latency and throughput are strongly influenced by concurrency and work unit size.

Larger work unit sizes lead to higher latencies and lower latency throughput. Concurrency is the number of aggregate work units (e.g., message, business process, transformation, or rule) processed in parallel (e.g., per second). Higher concurrency also leads to higher latency (wait time) and lower throughput (units processed).

To visualize server performance across the range of possible workloads, we draw a graph of latency or throughput against concurrency or work unit size, as shown by the above graph.

As shown by the figure, a server has an initial range where throughput increases roughly linearly, and latency either remains constant or linear. The approximately linear relationship decays as concurrency increases, and system performance degrades rapidly. Performance tuning attempts to modify the relationship between concurrency and throughput and/or latency and maintain a linear relationship as long as possible.

For more details about latency and throughput, read the following online resources:

Doing the Performance Test

Your goal of running a performance test is to draw a graph like the one above. To do that, you have to run the performance test multiple times with different concurrency and, for each test, measure latency and throughput.

Following are some of the common steps and a checklist.

Workload and Client Setup

Each concurrent client simulates a different user. Each runs in a separate thread and runs a series of operations (requests) against the server.
The first step is finding a workload. If there is a well-known benchmark for the particular server you are using, use that. If not, create a benchmark by simulating the actual user operations as closely as possible.
Each message generated by the test must be different. Otherwise, caching might come into play and provide results that are too optimistic. The best method is to capture and replay an actual workload. If that is not possible, generate a randomized workload. Use a known data set whenever it makes sense.
We measure latency and throughput from the client. For each test run, we need to calculate the following.
End-to-end time is taken by each operation. Latency is the AVERAGE of all end-to-end latencies.
For each client, we collect the test-started time, test-end time, and the number of completed messages. Throughout is the SUM of throughput measured at each client.
Use nano second resolution to to measure the time if possible.
It is best to take readings for about 10,000 messages for each test run. For example, with concurrency 10, each client should send at least 1000 messages. Even with many clients, each client should at least send 200 messages.
Many tools can do the performance test. Examples are JMeter, LoadUI, javabench, and ab. Use them when applicable.

Experimental Setup

You may need to tune the server for best perforce with settings like enough Heap memory, open file limits, etc.
Do not run both the client and the server on the same machine (they interfere with each other, and results are affected)
You need at least 1GB of network to avoid interference of the network.
Generally, you should not run more than 200 clients from the same machine. In some cases, you might need multiple machines to run the client. Watch the load average of the client machines; they should be less than 100.
You have to note down and report the environment (Memory, CPU, number of cores, operating System of each machine) with the results. Measuring the CPU usage and memory while the test is running is a good practice. If you are on a Linux-style machine, run the “watch cat /proc/loadavg” command to track the load average. CPU usage is a very unreliable matrix as it changes very fast. However, the load average is a very reliable matrix.

Running the test

Make sure you restart the server between each of the two test runs.
When you start the server, first send a few hundred requests before starting the real test to warm up the server.
Automate as much as possible. Running one command should run the test, collect results, verify the results, and print summaries/ graphs.
Make sure nothing else is running in the machines at the same time.
After the test run, check the logs and results to ensure that the measured operations were successful.

Verifying your results

You must catch and print the errors on the server and the client. If there are too many errors, your results may be useless. Also, verifying the results on the client side is a good idea.
Performance tests are often the first time you stress your System, and you will often run into errors. Leave time to fix them.
You should use a profiler to detect any obvious performance bottlenecks before you actually run the tests.

Analyze your results and write them up.

Data cleanup (Optional) is a common practice to remove the outliers. The general method is to remove anything more than 3 X standard deviations different from the mean or remove 1% of the furthest data from the mean.
Draw the graphs. Ensure you have a title, captions for both X and Y-axis with Units, and a legend if you have more than one dataset in the same graph. (You can use Excel, OpenOffice, or GNU Plot). The rule of thumb is that the reader needs to be able to understand the graph without reading the text.
Optionally, draw 90% or 95% confidence intervals (Error bars)
Try to interpret what the results mean.

Generalize
Understand the trends and explain them.
Look for any odd results and explain them.
Make sure to have a conclusion.

Performance is a critical part of the system design. Only by measuring the performance and understanding how performance changes with changes to the architecture can we build an intuitive feel for the System’s performance, which is a hallmark of a great architect.

If you enjoyed this post, you might also like my new Book: Software Architecture and Decision-Making.

Get the Book, or find more details from the Blog.

Please note that as an Amazon Associate, I earn from qualifying purchases.