How did we find a hidden bug?

Matvei Koniaev
4 min readJul 18, 2023

This is not a success story; it’s about how we messed up during refactoring, discovered a major bug, and what we learned from it.

Have you ever experienced unexpectedly stumbling upon something important? Well, we certainly did. This article recounts how we found a bug due to some silly mistakes and were surprised to realize it had been functioning that way for a long time.

Consider this a reminder or a trigger to check your code and ensure you don’t have similar issues.

Decisions and Consequences

One day, we made the decision to refactor some old code and replace it with new, beautiful, maintainable, and expandable code. As you may know, such decisions can bring problems for teams, businesses, and customers. However, we were determined to succeed.

One of the problems we encountered was the large number of changes in the git, which made it difficult to review and identify mistakes. But this is a common issue with a known solution. We decided to make small merge requests and refactor step by step, incorporating reviews and tests. What could possibly go wrong? Well, as it turned out, everything.

After merging all the changes, on a Friday night, we received a message from a team that has integrated with our application. They informed us that the API was not providing the correct contract as stated in the documentation.

Reflecting on our problems

The first mistake we made was changing the contract without considering the implications. The review process was not thorough enough, allowing this problem to slip through to the staging layer. Once we realized the issue, we promptly reverted the changes and began investigating why we hadn’t anticipated these problems prior to merging the pull request. Additionally, we questioned why our external tests had not identified these issues.

The answer was surprisingly simple: our old application lacked an adequate number of contract tests. While we had numerous tests for various aspects of the application, we didn’t have tests for contract. This was our second mistake — performing refactoring without ensuring the presence of high-quality contract tests.

The third mistake involved our external tests. Unfortunately, we had not thoroughly examined these tests beforehand, and as a result, there were no tests specifically focused on the contract.

Three common mistakes occurred consecutively, mistakes that we should have anticipated.

The Devil is in the Details

A few days later, we encountered another problem with the integration. This issue was less obvious and stemmed from incorrectly utilizing a library.

Let’s see some examples of data. External service expected this format of data.

{
"key": {
"another-key": "some-value"
}
}

Real format of data:

{
"key": [
{
"another-key": "some-value",
}
]
}

The exact problem we encountered was related to an incorrect approach in using the “xmltodict” library to convert XML to JSON. While our application provided the correct interface, the behavior of the library introduced some peculiarities.

Specifically, when working with an array that contains only one object, the library would convert the array into an object. On the other hand, when working with two or more objects, the library would return a list of objects. This inconsistency in the library’s behavior caused unexpected issues in our integration.

For example, if we had an XML element representing an array with only one object, the library would convert it into a JSON object instead of an array. This inconsistency in the conversion process led to further complications and errors in our application’s functionality.

It’s crucial to understand and account for such peculiarities when working with third-party libraries, as they can inadvertently change the expected behavior of our code. Below are some examples of the findings we encountered.

>>> xmltodict.parse("""
...<xml-document>
... <another-key>
... <many>elements</many>
... </another-key>
...</xml-document>
...""")
{
"xml-document": {
"another-key": {
"many": "elements"
}
}
}
>>> xmltodict.parse("""
...<xml-document>
... <another-key>
... <many>elements</many>
... <many>elements</many>
... </another-key>
...</xml-document>
...""")
{
"xml-document": {
"another-key": {
"many": [
"elements",
"elements"
]
}
}
}

Indeed, it is quite intriguing to observe how four consecutive mistakes from two different teams ultimately led to an incorrect transmission between legacy services that had been functioning on the production layer for a significant period.

These mistakes serve as a valuable reminder of the importance of thorough testing, proper review processes, and anticipating potential issues when making changes to a codebase. In this case, the lack of sufficient contract tests, inadequate review, and oversight of external tests contributed to the propagation of errors throughout the system.

What’s we learned ?

We learned pretty basic staff, which may seem like basic knowledge but warrants repetition:

1. Conduct thorough and gradual code reviews: Deep and comprehensive code reviews help us better understand the codebase and minimize the chances of overlooking potential problems.

2. Write tests, even for legacy projects: Regardless of whether you’re working on a legacy project or not, writing tests is crucial.

3. Verify external tests: If your team relies on external tests, make sure to review and check them before making any changes. They can serve as an additional barrier to catch potential mistakes.

4. Familiarize yourself with the documentation: Before diving into coding, take the time to thoroughly read the relevant documentation.
5. Don’t change contract at all without agreements!

Thank you for reading until the end! Feel free to ask anything you’d like to know or discuss further!

--

--