What Every Test Engineer Needs to Know About Observability

Heemeng Foo
The Startup
Published in
5 min readDec 15, 2020

--

Photo by Sasha • Stories on Unsplash

My first encounter with the power of Observability

Way back in 2015 I was involved in the first ever internet live streaming of an NFL game (see [1]). I was in charge of quality for the front end portions ie. the iOS, Android and Web experiences. We were testing almost everyday and on weekends using real live events with significantly lower viewership. This went on for weeks.

On one such test, the engineers discovered a bug in the playback for our Web player. I naturally volunteered my team to support testing for the corrective fix. To my surprise the engineering team said they did not need it as they had instrumented the code so that they would be able to tell from the telemetry data beacon-ed back at the next live event if the bug had been fixed or not.

I was taken aback by this. It then dawned on me that in the era of true CI/CD (CD being continuous deployment), that the old approach of making sure there are zero bugs before release is not going to work and is in fact redundant.

What is Observability?

In “The complete guide to observability”, James Burns from Lightstep (see [2]) defines it as such:

It’s typically defined as a measure of how well you can infer the internal state of a system using only its outputs.

He further goes on to say:

For observability, the outputs that matter most are telemetry data — granular measurements of the system that allow Service Level Indicators (SLIs) such as latency, uptime, and error rate to be measured and explained.

Such telemetry data typically come in the form of:

  1. Logs
  2. Metrics
  3. Traces

In the NFL event described above, the Web player team had gone one step further: instrument the code so that the signals for the code paths executed and the error triggers were also transmitted back. Thus they were able to correlate the incidents of each and hence tell whether the bug was manifesting and hence if the fix was successful.

Why is Observability critical to Test Engineering

All test engineers know there is no way they are able to test every possible code path and data combination. We mostly test only that (a) the product meets the specifications and (b) we undertake a best effort approach to covering the edge cases. In fact there is a well known joke about test engineers that goes like this (taken from reddit):

A software QA engineer walks into a bar. He orders a beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd. First real customer walks in and asks where the bathroom is. The bar bursts into flames, killing everyone.

The problem here is that out in the wild (or rather out in production), users will use the product in the most unexpected ways. Also the number of possible data combinations can be staggering.

Add to this that in order to stay relevant, businesses always want products to be shipped yesterday and hence you don’t have a lot of time to test.

The only way to keep up is to instrument the code so that (a) you know which code paths have been executed and (b) data is collected about whether individual aspects of the product are functioning correctly. Such instrumentation has been around for quite a while especially with the introduction of A/B testing. Couple this with CI/CD and the ability to deploy code to production multiple times a day and you have essentially a hypothesis testing engine.

In other words:

  1. Use the signals sent by the app to tell you if the bug is manifesting
  2. Make a hypothesis about which code paths are causing the problem and put in a fix
  3. If there is no signal that indicates the bug is manifesting shows up after a period of time, you can assume that the bug is fixed

This significantly changes the dynamic of testing as it means that the role of the Test Engineer becomes one of not only identifying what can be tested in the sandbox of the test environment but also what needs to be instrumented so that we know exactly what is going on in production.

What do Test Engineers need to know about Observability

The Tools

The common tools in this space are (just to name a few):

  1. Lightstep
  2. Splunk
  3. Datadog
  4. ELK ie. ElasticSearch, Logstash, Kibana
  5. AppDynamics
  6. New Relic

It used to be that you would have to spend a lot of time installing and configuring tools but nowadays you can opt for the cloud hosted version to get yourself familiarized fairly easily. They usually have a free tier or trial period so the barrier to entry is pretty low.

2 Key Skills: Instrumentation and Query

For most of the tools eg. New Relic, Lightstep, you just need to install their agent or library and voilà you have data streaming in albeit the basic information eg. resource utilization (if you’re dealing with containers). However if you want to perform the kind of testing described above you will need to know how to instrument custom events. This is usually described in the documentation and varies from tool to tool. However one thing is common and that is that with each event comes a whole bunch of information eg. IP, timestamp etc. These are essential for helping to troubleshoot the issue.

Next comes the query language. Again each tool has their own eg. New Relic has NRQL (or New Relic Query Language) and Splunk has SPL (or Search Processing Language). The good news is that most (I’d use all but I’m being conservative here) are based on SQL so the learning curve is not that steep.

Experimentation

There is no substitute for actually working with the code. If you’re a decent test engineer you would have a simple test app (web, mobile or microservice) that you’ve built in your spare time. Next instrument the code on the test app using the API/agent and see the data beacon-ed back to the tool (eg. Splunk, New Relic). Next get familiarized with the query language to obtain the data you are looking for. Lastly, build some simple dashboards to monitor the incoming data.

What’s next?

The final part is to work with the engineering team to work out the structure, process, coding and naming conventions for the events that will be instrumented in the product. If these already exist, follow them. Don’t annoy your engineering team. Help them understand what you are trying to achieve with this. Quality is a team sport, we’re all in this together.

References

[1] The NFL’s first ever live-streamed game on Yahoo attracted 15 mil viewers, Sarah Perez, Oct 2015, https://techcrunch.com/2015/10/26/the-nfls-first-live-streamed-game-on-yahoo-attracted-over-15-million-viewers/

[2] The complete guide to observability, James Burns, Lightstep, https://go.lightstep.com/register-complete-guide-to-observability.html

--

--

Heemeng Foo
The Startup

Test automation, Engineering management, Data Science / ML / AI Enthusiast