PagerDuty Blog

How to Learn From Failure in DevOps

DevOps failure is a touchy subject with some, because DevOps is typically perceived as a way to avoid failure. As a result, when you fail in a DevOps practice, the situation can seem almost hopeless. However, just as a fail-fast business approach, or the “fail and adjust sooner” methodology of Agile often proves, DevOps failures are actually a step in the right direction. They’re the first step toward learning from failures and turning your DevOps practice into one that will lead you toward even greater success, sooner rather than later.

DevOps has its roots in Agile, where shorter development cycles with frequent feedback loops guide you quickly, over a period of time, toward product delivery that is more aligned with customer needs. The point of the feedback loop is to learn from your actions through customer feedback, then measure what went right and what can be improved.

The feedback loops are effective because frequent change and failure provide opportunities to learn in a way that actually reduces risk. There are many types of failures that can strike a DevOps practice, and your reaction matters. Let’s examine some now.

Incident Reaction Failure

When software issues occur — whether they’re deployment-related or bugs — your reaction to them is often more important than the fact that they occurred in the first place. Failure here can occur in multiple forms, including:

  • Overreaction: Spending too much time or too many resources fixing or trying to avoid repeat failure
  • Incorrect reaction: Misdiagnosing a problem or assuming the wrong problem, potentially due to lack of information
  • Lack of reaction: Not fixing the problem soon enough, or effectively enough, and the issue arises again soon after

Incident management and appraising its success through KPIs are important parts of measurement that are critical ingredients in DevOps success. Quickly assess and resolve incidents by surfacing relevant information and recruiting other team members to help, and looking at the system as a whole instead of pointing at individual components (and the people behind them) as the cause.

The takeaways here include a holistic approach to incident response, spreading knowledge, learning from failure and past incidents to prevent future issues, and automating responses to solve issues more quickly.

Creating Too Much Process

Some think DevOps is simply a process or a tool and go about setting up rigid, formal sets of procedures to follow. But imposing too much formality, being too strict about process, and mandating certain toolsets can limit your organization’s flexibility when it comes to change. Instead, DevOps should include tools and procedures that enhance your organization’s nimbleness.

For example, tools that measure your overall software development and deployment efficiency will help provide feedback to improve that efficiency. How? By pointing out where changes need to be made and the bottlenecks that need to be removed. With too much rigidity, your organization may not be able to change quickly enough to improve or meet the needs of a changing user base or market.

Limiting the Scope of DevOps

The work of a DevOps practice doesn’t fall on a single person or even a team within your organization. I’ve witnessed situations where individuals were specifically hired as “The DevOps Person” who would magically do “DevOps stuff” to fix all of the existing software deployment and maintenance problems. But here’s the short of it: That approach will fail.

Similarly, I’ve also seen situations where customer service and support personnel have received calls from customers about a new feature that was installed over the weekend that they had no prior knowledge of. When key support people learn about changes to your software from your customers, you have a DevOps failure.

DevOps is an organizational practice that takes what was learned from Agile development and applies it end-to-end throughout software delivery. It’s a practice that should be extended into other functions across the business. This means development work is aligned with customer value, not projects, and product teams work not only with IT staff, but also with the folks who answer the phones, write the technical documentation, advertise and market the application, and serve as business sponsors — including the executives who plan for the future. There’s something for everyone to learn when feedback loops are expanded to comprise everyone in the organization.

The takeaways here include expanding feedback loops, communication, and key measurement activities to all parts of your organization. Additionally, don’t ignore third-party vendors, suppliers, or components. Remember to include validation, auditing, and monitoring for external components as well.

The Blame Game and Competition

Given that Agile often uncovers bottlenecks in an organization’s software delivery pipeline, it’s easy to point the finger at people or activities that are perceived to slow things down. When this occurs, DevOps can drive an even greater wedge between teams—which is the polar opposite of what was intended.

Instead, remove the silos (teams or people who tend to work in a vacuum) and tear down the walls between teams before identifying bottlenecks and improvements. With everyone working together first, with a shared set of responsibilities, improvements will occur as a unified team rather than as a result of competition between them.

Keeping a Silo or Two

It’s not unusual for organizations — even those that have seen actual success with Agile and DevOps — to create exceptions within the company when it comes to the practice. Perhaps it’s a legacy application, a proven team, or even a veteran employee. However, excluding a single person or team from the DevOps practice can be troublesome.

Silos have a way of multiplying and becoming toxic to an organization’s software delivery practice. Even if the silo includes a company co-founder, no one should be exempt from the processes and collaboration that your organization’s DevOps practice encourages. In a nutshell: DevOps means removing silos and bottlenecks. No exceptions.

Ignoring the Development Environment

DevOps applies to more than the production environment and deployments. Even when Agile development sprints are successful, production deployments are automated, and testing is continuous, failing to extend those practices into the development environment creates its own problems. For example, if development and test environments aren’t managed using the same tools, approaches, and people that manage your production environment, you risk production problems the first time your software encounters potentially unique configurations, which often leads to unexpected failure.

Know how the development work is done, where it’s being done, and where it’s going to end up in production. If any one of those three is an unknown at development time, the risk of failure will grow to a near certainty at release time. It’s absolutely crucial to orchestrate and control your development workflows and environments in the same way they’ll likely exist in production.

Too Much Team Access

The increased teamwork that results from DevOps is a good thing, and so is the growing capability team members achieve as they’re exposed to newer procedures or components in your software. However, forgetting the added responsibility that comes along with this can lead to critical failure. For example, while it’s generally good that DevOps might help a UI developer become familiar with an application’s database — and empower that UI developer to even roll out DB changes self-service — it’s not good when changes are made to a production database accidentally. While bottlenecks are removed, controls also need to be in place to avoid catastrophe.

The takeaways here are to monitor and report unexpected access to critical production systems, and to put procedures in place to roll back (or even roll forward) damaging changes made, unintended or not.

Conclusion: It’s About People

The goal isn’t successful DevOps in and of itself — it’s neither a process nor an activity. Instead, the goal is to use the tenets of DevOps to improve your software, customer experience, and organization. In fact, if any one of these is missing, life might prove difficult, even if you think you’re succeeding at DevOps.

It’s important to remember that all of these ingredients involve actual people. Regardless of how DevOps helps you get there, learning from failure, instilling a culture of constant improvement, making customers happy, and having fun while doing it are what’s truly important. Every day as you work to accomplish your goals, keep in mind what really matters.