What every developer should know about fault scenarios

13 min readJul 28, 2023

Painting by Ivan Konstantinovich Aivazovsky

Dealing with application failures is something common in the life of a software developer, and although expected, nobody likes to go through these situations. More often than not, failures are accompanied by frustrated or even angry customers, support team burnout, and pressure on on-call developers.

These circumstances usually occur due to failure scenarios not mapped by the technical team. And I advance, it is impossible to map all the situations in which an application can fail.

How then can we mitigate the damage caused by defects in production?

To answer this question, we must understand some important principles of software development and how they help us to avoid a catastrophic escalation of an application failure.

First of all, it is important to highlight that the ideas discussed here belong to the context of development. Therefore, solutions and practices more related to Ops, QA and Infosec are outside the scope of this article — although we can make some references throughout the text.

In this article you will understand more about:

Errors, defects and failures
Murphy’s Law and computer networks
What is and how to deal with temporal coupling
Working with atomic transactions
The importance of automated testing
Acceptance criteria and the guarantee of reliability
Observability and best practices in creating alerts

Errors, defects and failures

In the area of software testing, it's common to segment the malfunction of an application into three complementary categories: errors, defects and failures. Understanding how these concepts relate to each other is extremely important when it comes to screening, mitigating and resolving production failures. So let’s see what each one means:

Error

Error is a human action that produces an incorrect result. For example, when reading a functional requirement, a programmer may not pay attention to a specific detail and produce incorrect code. Here, the common meaning of the word, found in dictionaries, applies.

Defect

A defect, also known as a bug, is the result of an error. For example, poorly written code is a defect that when executed can cause crashes.

Failure

Failure, as already mentioned, is the effect caused by a defect. For example, an application that does not respond correctly to an HTTP request is an example of failure.

When we are faced with problems in a production application, the first contact we have is with the failure. From there, we start an investigation to find the cause of the failure, that is, the defect. At first, we should not waste energy to discover the error that caused the defect. The priority is mitigating the failure, either through a rollback or a hotfix.

Later, when the application has been normalized, it is important to identify the error that caused the defect and consequently the failure. This can be done through post-mortem meetings. It is important not to point blame during these meetings, as well as to foster a safe environment for people to expose themselves without being judged.

“When incident and accident responses are seen as unfair, it can derail safety investigations, promoting fear rather than attention in people doing critical work for the product, making organizations more bureaucratic rather than more careful. , and cultivating professional secrecy, evasion and self-protection”. — Gene Kim et al., The DevOps Handbook

Understanding the role that each of the malfunction categories played during a failure occurrence helps us to continue improving the application, whether through the implementation of human processes to avoid errors, automated software processes to avoid defects or automated deployment processes to mitigate the flaws.

We must remember that the only sustainable competitive advantage is an organization’s ability to learn faster than the competition. And this is only possible when we are transparent and humble to assume our mistakes.

Murphy’s Law and computer networks

Many of you may have heard about Murphy’s Law, a maxim that has been around for some time and expresses a certain negativity by stating that “if anything can go wrong, it will.”

Although it is a purely theoretical concept and not exact, we can take it seriously as it helps us to prepare for failure scenarios.

And how can we know if something can go wrong? This question can be answered simply by observing how much the result we expect depends on variables beyond our control. And when it comes to computer networks, the variables are countless.

In modern software development, functional requirements often require applications to integrate with third-party systems through APIs. This integration is only possible thanks to the worldwide computer network, also known as the Internet.

Through this network and the client-server model, we can make requests to a server and obtain the respective responses during an application transaction. This allows us to expand the functionality of our software and thus provide our users with a greater value proposition.

However, this mechanism can be a trap for an inexperienced developer, who, because he does not understand well how a network connection works, can end up making the application very dependent on I/O and consequently vulnerable to several network failures and latencies.

In fact, network message passing is not as cheap as inter-process message passing. In addition, messages are transported over a physical medium that is subject to all kinds of defects, and, applying Murphy’s fateful Law, we can expect a failure to happen (a cable breaks, a packet is lost, the server becomes unavailable). This is more evident in network requests that use the TCP protocol, where there is a need for a handshake and multiple packet trips — you can obtain more information in this article published on the Microsoft website.

Of course, we must be cautious when deciding to make network calls in our application. Furthermore, this is a concern that must be taken into account when modeling the architecture of our system. Otherwise, we’ll end up with an application choked by temporal couplings and prone to constant crashes.

What is and how to deal with temporal coupling

Temporal coupling occurs when an upstream service needs to synchronously communicate with a downstream service during an operation. Therefore, both services need to be available and healthy at the same instant of time for the operation to be successful.

Temporal couplings are not something we should avoid at all costs. Most of the time they are needed so that we can fulfill a usage requirement. However, creating unnecessary couplings can end up compromising and making application uptime unfeasible. That is, the more coupling there is, the more latency our system will have, as well as the greater chance of network failures.

Another point that needs to be highlighted regarding these synchronous communications is the fact that, if a failure occurs, the entire transaction is interrupted, forcing the client to make a new attempt. To make matters worse, if our system does not have idempotency mechanisms, a second transaction may fail due to conflicts between the request payload and the application’s internal state.

If our application consists of a distributed system, we must be even more careful with temporal couplings, since, in microservices architecture, it is common to have a high dependency and integration between network components.

Given all these points of attention, how can we measure the degree of coupling and possible solutions to these problems?

A relatively easy way to identify temporal couplings is to draw a UML sequence diagram of our transactional flows. Through this design, we get a clear view of the number of dependencies between our services during the execution of an operation. The diagram also helps us to discover the most critical points, so that we can dedicate efforts to the highest priority parts of the application.

To reduce the amount of coupling, on the other hand, we may need to reshape our architecture to support a new style of integration. This can lead us to the use of asynchronous communication which, in addition to increasing fault tolerance, also improves the resilience of our services.

However, there is an extremely important concept that must be very clear in our head before we consider a definitive solution: atomic transactions.

Working with atomic transactions

We must ensure that our services continue to function when failures occur, possibly across our entire system and preferably without a crisis or even manual intervention.

Achieving this level of resilience involves analyzing not only infrastructure technologies, but also conceptual aspects that dictate how our system works.

We discovered in the previous chapter that we can mitigate the impacts of temporal couplings through asynchronous communications. This style of inter-service communication makes modeling of atomic transactions possible. But after all, what are atomic transactions?

If you’re already familiar with relational databases, you’ve probably heard about the acronym ACID, that is, Atomicity, Consistency, Isolation and Durability. This is a standard that almost all popular DBMS follow in order to ensure reliable data persistence. However, what interests us here is the first concept: Atomicity.

In the ACID standard, atomicity proposes that each database transaction is treated as a single unit, regardless of the number of operations that may occur internally (disk reads or writes) and, therefore, can never end in a partial state. That is, the transaction can only have two types of results: success or failure — there is no “sort of…” in an atomic transaction.

As in the ACID model, in the context of communication between services, an atomic transaction does not necessarily have only one operation. We can have more than one operation and still have an atomic transaction, as long as these operations agree with their result, that is, either they all end successfully, or they all fail (and revert). However, the more operations we have in a transaction, the harder it is to achieve atomicity.

When I say “operation” in the previous paragraph, I’m referring to requests for external services, whether third-party applications or databases. Each read/write to the database, each network call or each read/write to a file counts as an operation within the transaction. We must therefore strive to reduce as much as possible the number of these operations so that we can achieve atomicity.

However, an important question arises: how is it possible to obtain an atomic transaction with more than one operation?

The answer is simple: through idempotency!

Idempotency is the ability to obtain the same result regardless of whether the request has been made before, without causing side effects. And it is possible to obtain this effect through some relatively simple mechanisms, such as:

Persist the last response of an operation into a cache layer;
Handle errors to bypass issues related to state conflicts;
Store transaction state to resume where it left off.

Therefore, it is possible to notice that there is an evident intersection between the concepts of atomic transaction and temporal coupling. That is, reducing the amount of temporal coupling in our application operations allows us to achieve atomic transactions, which in turn makes our application more reliable and resilient.

The importance of automated testing

Automated testing is a subject that has already generated a lot of content and endless discussions in the area of software development. Therefore, I do not intend to go into detail about good testing practices. My intention, in this article, is just to remind you about the importance of testing our applications, more precisely, failure scenarios.

Writing good software tests can come down to understanding how our application can break, the possible harmful inputs, and what we hope to see as a result. In other words, this might require that we first define our failure modes and then test to ensure those failure modes work as designed.

However, to obtain a higher level of reliability, we must go beyond unit and integration tests. One way we do this is by injecting failures into our production environment and rehearsing large-scale failures so that we are confident that we can recover from accidents when they occur, preferably without impacting our customers.

“A service isn’t really tested until we screw it up in production” — Jesse Robbins, Chef co-founder.

This large-scale automated testing methodology, also known as Chaos Testing, has been practiced for some time at Netflix and has become an extremely important practice for the organization’s technical team. The automated tests carried out by Netflix evolved in such a way that it even gave rise to an important open source tool called Chaos Monkey.

It’s natural to see resistance to the idea of injecting fault into our production applications. However, when we do this in a controlled way and with the support of reliable tools, we can deliberately make sure that we have the ability to react to a real failure scenario.

“Exercising repeated and regular failure, even in the persistence [database] layer, should be part of every organization’s resiliency planning.” — Gene Kim et al., The DevOps Handbook

Acceptance criteria and the guarantee of reliability

Now I would like to touch on a subject that, although beyond the responsibility of a software developer, is very important for the reliable delivery of an application or functionality.

Usual software testing is not the only way to ensure that a feature behaves as expected. There is another layer of assurance that can and should be explored: Acceptance Criteria.

Acceptance criteria are documents normally produced by the business team (for example, the PO in the Scrum methodology) and are used as a reliability assurance mechanism. It is quite common to be used in complementary tests before delivering the solution to production. We can even get more out of these documents by also using them as inputs for our unit and integration tests.

It is important to know that good acceptance criteria have sufficiently clear and contextual information. Thus, for a document to be useful, it should briefly describe the use case, the action that will be taken, and the expected result. An excellent suggestion is to use the Given/When/Then model to write the criteria, see an example:

**Given** that I am on the product page

**When** I click the *"add to favorites"* button

**Then** the favorites container will be updated to reflect the new quantity

In addition, the business team can also meet with the technical team to review the proposed document and how they will carry out the tests. Thus, the two teams can clarify any doubts and share the same expectations.

Adding acceptance criteria to our software development tooling raises the level of reliability and security of our deliveries, helps to keep our system always in an implementable state and consequently increases the value perceived by our customers.

Observability and best practices in creating alerts

We come to our last topic and, although it was left for the end, it is no less important.

We must be able to respond quickly to production failures when they occur, otherwise irreversible damage can be caused to our customers. Furthermore, this is one of our responsibilities as a technical team.

Therefore, rapid incident response involves the use of good telemetry tools. They should be able to give us enough information so that we can confirm that our services are working correctly in production and, once an anomaly is detected, enable us to quickly determine what is going wrong, make informed decisions about the best way to fix it. them, and preferably well before customers are impacted. In other words, telemetry is what allows us to have our best understanding of reality and detect when it is incorrect.

To do so, we must carefully choose a good APM tool, giving preference to those that include not only real-time metrics of the application (such as CPU, Memory, Throughput, Database and Network Latency), but that also allow us to manage a reliable notification system. This is why it is vitally important to have an alert mechanism correctly configured, in a way that allows us to know as soon as an incident occurs, or if possible, before it occurs. This may require knowledge of historical application traffic volumes so that you can set efficient thresholds for triggering alerts.

There is yet another important monitoring instrument: application logs.

In some failure scenarios, infrastructure metrics might not be sufficient to determine the defect that caused the failure. Logs can make a huge difference in speeding up and making the debugging process more effective. So knowing where and how to log chunks of data during an application transaction can be a game changer.

However, care must be taken when deciding which logging mechanism to use. Logs should not be a source of bottleneck in our application. Therefore, it is important that log processing is separated from the main thread of our system, so that, if we make use of network calls to persist the logs in an external database, our main transactions are not impacted with possible latencies or network failures. A good logging strategy most often involves capturing the data emitted to the process’ standard output (stdout file descriptor) from a separate thread or process.

Below, I share some extremely relevant key questions taken from Susan J Fowler’s book Production-Ready Microservices:

Alerts

Is there an alert for all key metrics?
Are all alerts defined by appropriate flag thresholds?
Are alert thresholds properly configured to trigger an alert before an outage occurs?
Are all alerts actionable (require a handle)?
Are there step-by-step instructions for triage, mitigation and resolution for each alert in the on-call roadmap?

Logging

What information does this application need to log?
Does this app log all important requests?
Does logging reflect the state of the application at any given time?
Is this logging solution scalable and cost-effective?

Conclusion

In this article, we look at some important principles when dealing with software failures. It is vitally important to know the main failure scenarios before delivering a solution into production, so that we are better prepared when an incident occurs.

In addition, it is necessary that the technical team is always willing to learn from its mistakes, as failures will always occur. History shows how learning organizations think about failures, accidents and mistakes: as a learning opportunity, not something to be punished.

I hope the article was helpful to you. If you liked it, be sure to leave some “claps” and share it with your friends.

See you soon ;)

My Linkedin | YouTube channel