Working in any area of the software development process, from coding to testing, we're going to make some mistakes from time to time. Usually, these mistakes are small and easily fixable - a failed test because of a recent change, buggy code that causes a runtime error, and so on. No harm, no foul.

But sometimes, we can make more significant mistakes. I'm not talking about issues that are annoying but otherwise harmless. I'm talking about huge mistakes with real consequences that cause you or your company to lose lots of time and money or damage one's reputation. When you're the cause of these mistakes, it one of the worst feelings in the world. All sorts of thoughts cross your mind - I'm going to lose my job over this, and no one will ever want to hire me again for the rest of my life.

However, no matter how bad a situation might seem at the moment, these problems aren't nearly as disastrous as you might think. How do I know this? I've had plenty of first-hand experience throughout my career when it comes to making some terrible mistakes. Even with some atrocious missteps, I've survived relatively unscathed and recovered from them with some work.

I wanted to share three of the biggest mistakes from my career as a software developer and automation tester. These anecdotes are not intended merely for others to face-palm or laugh at my oversights (although it's perfectly acceptable if you do). As mentioned earlier in this article, everyone makes mistakes. The key is to learn from them, and that's what I hope these stories provide.

Running tests against a live production database

I've worked in tiny startups for most of my career, often as the sole developer or automation tester in the entire organization. That meant putting on a dozen hats, including development, testing, and system administration. As a one-person IT department, I often had full access to every production server the company needed to run their software.

In one project early in my career, the project I worked on had a bug that only occurred in our production environment. The issue didn't surface on my development computer, nor on the staging server, which I thought ran the same as production. Of course, there was some slight misconfiguration on staging, causing the bug to slip through the cracks.

Once I determined the problem, I patched the bug and wrote an automated regression test case covering the scenario to avoid having the issue pop up again. After confirming I fixed the bug in production, I decided it was a good idea to run our automated end-to-end test suite against production, using an admin account to run through any authenticated scenarios.

After half the test suite executed successfully, I instantly turned pale as a ghost when I remembered that the test suite cleared some database tables as part of its setup and cleanup process. I immediately halted the test execution, but it was too late - the tests wiped out a good portion of live customer data.

How I recovered

One word: backups. Thankfully, the application wasn't too complicated and had an automated backup system in place that took snapshots of the database every hour. I put the site in maintenance mode and restored the last backup, luckily completed less than 30 minutes prior.

I was also fortunate that the incident occurred early in the morning when very few customers were on the site. Checking the site's logs and analytics, there were just a handful of people using the application when the accidental data deletion happened. No one's data had been gone forever, thanks to those backups.

Lessons learned

  • Always have a sound backup system in place for any service with real-world users, no matter how small. Also, make sure those backups work, and you can quickly restore them in case of emergency.
  • Never, ever run automated tests against production environments. Even if you have problems that only occur on those systems, it's best to check them manually, so you know what you're doing. It's not wise to let automation take over, especially when data cleanup is involved.
  • Limit admin access to your production systems. In a sense, having access to the entire system helped me restore the database quickly without affecting others due to my mistake. If I were part of a larger team, I probably would have been able to get into this problem in the first place.

Triggering an accidental "Denial-of-Service" attack on my own site

Almost all of the projects I work on nowadays have some third-party integration extending the application's capabilities. Most of these services are part of the site's core functionality. Without these services in place, the application would not work as intended. How much a non-functional integration affects the app ranges from degraded functionality to nothing working at all.

One particular project I worked with had a few third-party services that were critical to the app's existence. If any one of these integrations stopped functioning, the entire site ceased to work. It was highly essential for us to ensure that these integrations worked with our code, so we had lots of automated tests than ran against the live service.

One day, I had a very intense debugging session over a problem with one of these integrations. While attempting to fix the problem, I continuously ran some automated tests that pinged the service. We already had automated test cases running the buggy scenario. In a short period, I kept making changes to the codebase and executing the tests to see if I fixed it. This workflow accelerated the debugging process for development. However, the integration wasn't happy about it.

That particular service didn't have a sandbox environment that we could use during testing, which meant we had to use the same access keys to use the service during development. The terms of service for the integration specified a rate limit. If someone makes a certain number of requests to the service with a given time period, the company could throttle or shut down our access to prevent their systems from getting overloaded. Rate limiting is a common practice to avoid abuse and runaway processes from overwhelming a service.

Unsurprisingly, within a couple of minutes of running the automated tests over and over, the service decided that our access needed to get cut off. As mentioned, the service didn't have a testing environment, so it also shut off access to the live production environment, triggering a cascading effect of errors that brought the site down.

How I recovered

At the moment it happened, there wasn't anything I could do besides frantically reaching out to customer service for the third-party service. It took a couple of hours to restore our account's access - way too long for my comfort.

Immediately after this incident, I got together with the testing team to explain what happened. Together with the development team, we began putting safeguards in place to ensure this issue never happened again. Testers began finding areas of the automated test suite that could get by with mocks instead of hitting the live service, and developers started making plans for keeping the site available when one of these integrations failed for any reason.

Lessons learned

  • Limit the usage of live third-party integrations, especially when those don't have a sandbox environment for testing purposes. You and your team need to decide when live testing or mocks make more sense in your project.
  • If you need to perform live testing during automation, use a separate account for any integrations. Many third-party integration services provide sandboxes that won't affect your live setup. If the service doesn't provide a testing environment, create a secondary account if possible.
  • Any application relying on third-party integrations need to plan fallbacks in case these services fail. No third-party integration will have 100% uptime, so the team must ensure that the site can remain partially functional or provide a message to users that the site is unavailable. Limited usage or a maintenance message is better than a broken experience for your customers.

Misleading testers and developers for weeks thanks to one of my tests

I once joined a project that began to implement automated testing from scratch. The project was very complex, with lots of complicated business logic for dozens of scenarios. The project had zero test automation, and it took the existing QA team over a week to run through their regression test cases manually, so they were looking to automate some of the most tedious parts of their work.

Given the complexity of the project and its business rules in an industry I don't understand very well (finance), I didn't grasp many concepts immediately. The existing team had been working on the project for over a year at that point, most of them from the very beginning, so they had a firm grasp of this application's ins and outs.

Unfortunately, I didn't have the opportunity to spend much time with the existing team. They were all slammed with work - mainly because of their testing woes - so I was left working on setting up automated testing for the project. I felt confident enough to get the ball rolling at first.

I set up the project's testing environment and got to work writing the first few test cases to provide some guidance to the rest of the team to pick up where I would finish. The initial test cases worked well, and I had time on the project to continue pitching in.

There was one particular flow that the team wanted to have covered in the automation setup. It was a tedious scenario that was very complex and took a lot of their time. After days of wrapping my head around what was needed, I managed to automate the entire process from end to end. I felt proud of my accomplishment and continued working on other tasks.

A few weeks later, the project's tech lead pinged me and asked if I wrote that particular test. Proudly, I said I did, thinking he would heap some praise on my work. Instead, he proceeded to tell me that I missed a lot of different rules and scenarios in that flow, and the test was not particularly useful.

Worse yet, some testers and developers had been following the test scenario for their tasks, thinking that it was how that segment of the application should work. They had done some work assuming that my test covered everything related to that section, which led them to miss vital details and re-do some of the work about those segments of the code. I unintentionally misled them into trusting that my test was how things should work when I didn't have the entire picture.

How I recovered

After spending a good chunk of time with the tech lead to learning what my test missed, I immediately re-did the automated tests with my new-found comprehension. I made sure to ask for a review from the tech lead before committing any new code to make sure it covered what was necessary.

I also made sure to get together with the developers and testers affected by my misunderstandings to apologize and explain what went wrong. They understood the situation, and they were able to proceed with their work without further incident.

Lessons learned

  • Always have communication with the entire team and lots of it. Despite how busy the rest of the developers and testers were, I should have worked with them more closely to gain more knowledge about the system I was testing. It would have avoided a ton of confusion and additional work.
  • Double-check my assumptions and never jump into work without fully understanding the project. I was too eager to begin automating tests that I never questioned whether my limited insight of the project was the entire picture. When working on a new project, asking the right questions before doing anything makes a huge difference.
  • Trust what's available, but verify. Tests are an excellent source of documentation, and the team should use them to understand how things work on a project. However, it won't matter if you're not testing the things it should. Use tests as a guide, but make sure it's guiding you down the right path.

Summary: There's hidden magic in every mistake

These three stories are some of the more-stressful situations I've caused as a developer and tester, but this is just the tip of the iceberg. I've had plenty of mistakes throughout my career, and I'm sure I'll have plenty more before I finish. No matter how many mistakes I have made, one thing's for sure - they helped me grow as a tester.

Sure, at the time these situations happened, I wanted to disappear from the face of the Earth in shame. But in retrospect, I'm happy these problems happened when they did because they made me the tester I am today. Without the "hidden magic" behind these situations - the lessons I learned from them - I'm sure I wouldn't have the skill and ability with my work that I enjoy nowadays.

I'm also grateful that I had managers and team members who showed incredible compassion when I screwed up in my work. They knew I wasn't creating these problems intentionally, so they provided the support I needed to correct the issue and move forward. If you're a manager or part of a team, be forgiving and empathetic when others make mistakes. You don't know when you'll be on the other side of that equation.

Finally, if you're having difficulty with making mistakes with your work, don't. The times when you mess something up, as rough as they feel, provide some of the best moments to level up in your career. Don't shy away from mistakes. You will make some eventually. But what separates the good testers from the best is the lessons they've learned along the way and how it shaped them for the better.

How have you or your team recovered from a rather severe blunder during work? Share your story and lessons learned in the comments section below. Don't be embarrassed - we've all gone through something!