At Pixelmatters, we understand the critical role of delivering high-quality products that meet the needs and expectations of our clients. We not only place a strong emphasis on quality-driven development processes that enable us to create functional, reliable, scalable, and sustainable software over the long term, as well as our teams ensure high standards throughout the development process, from design to testing and deployment. Quality is our priority.
This mindset has led Pixelmatters to implement processes that prioritize identifying and preventing issues and bugs early on in the Software Development Lifecycle. This approach results not only in better product quality and long-term sustainability but also minimizes vulnerabilities and security risks. All of this results in faster deployments and less time to deliver the product to the market.
However, despite the effort of the whole team to create robust software systems, incidents do happen due to the software's inherent complexity, and users may encounter unexpected behaviors. Having a reliable approach to handle these Incidents is crucial, and that's where the Incident Management process comes into play.
In this blog post, we'll discuss the significance of Incident Management with a particular emphasis on Postmortem documentation and provide some useful tips on how to develop this kind of documentation effectively.
When reviewing an Incident, there are several objectives and factors to take into account, but we’ll talk about them later. For now, let's begin with the fundamentals.
What is an Incident?
An Incident results from a deviation in actual and expected results and disrupts the normal operation of a software system.
Typically, they impact the software system's availability, reliability, performance, or security and range in severity, from a complete web service outage to just a few users encountering errors. It is important to mention that an Incident includes bugs and other problems such as hardware failures, network outages, human errors, security breaches, or even natural disasters.
It is of the most importance that an investigation is developed to identify the root cause and take necessary actions to prevent such events. To monitor the process, roles are assigned to ensure that the Incident is tracked from the identification until it is considered solved.
Naturally, an Incident is only considered solved when the application works as expected and all the functionalities are restored.
Now, consider that the Incident is solved and gone. The storm has passed. Everything is calm. But our work is not over yet. It is time to move forward in our process and start developing the next piece of documentation: the Postmortem.
What is an Incident Postmortem?
A Postmortem is a written report developed after an Incident is resolved. By thoroughly documenting the Incident and its resolution, identifying all the root causes, and defining actions to take, the Postmortem report can serve as a valuable resource to prevent similar incidents from occurring in the future.
At Pixelmatters, when an Incident occurs, the process initiates by starting the documentation and including all parties involved in the resolution. These parties include Engineers, who worked on the affected code, Product Owners, QA Engineers, who conducted testing, and other relevant roles.
It is important to involve anyone with relevant information or insights about the Incident. By involving multiple perspectives and expertise, the Postmortem report can offer a more in-depth analysis of the incident and be more accurate.
This process is crucial to enable the team to thoroughly discuss and reflect on the various topics we will cover later in this article.
To create an environment where everyone feels comfortable sharing potential flaws without feeling rushed or pressured, and to avoid finger-pointing and the "Blame Game", we have decided to conduct Postmortems asynchronously.
Before we discuss the elements of a thorough Postmortem report, some best practices regarding when to begin this process are worth noting.
Postmortems should be conducted only when there is a significant Incident, such as failed deployments, security breaches, data losses, missed deadlines, repeated incidents, or major outages with user impact.
Not every issue needs to trigger an Incident response or a Postmortem analysis. So, to prevent resource wasting and counterproductive outcomes, refrain from initiating this process if you are experiencing minor issues, scheduled work, or proactive maintenance to prevent larger issues. Some examples of this include routine checks, updates, backups, or other tasks that are designed to prevent larger issues from occurring.
How is a Postmortem document structured?
To effectively structure a postmortem document, it's crucial to address specific topics and ensure that the Incident is retrospectively analyzed in a comprehensive manner.
Here are 13 topics we usually approach in Pixelmatters to help us create insightful and actionable Postmortem documents that might be helpful to you too 👇
1. Incident Overview
In the first section, provide a summary of the incident, including a description of what happened, the reason for the incident, its severity, and the duration of its impact.
2. Contributing Factors
Use the "Five Why's" technique to determine the root cause. Start by describing the impact of the issue and ask why it occurred, noting its impact. Then, ask why this happened and why it resulted in the impact that it did. Continue asking "why" until you arrive at the root cause.
In the process, if there were any actions taken that exacerbated the issue, also include them here to learn from any mistakes made during the resolution process.
3. Root Cause
Identify the underlying cause of the incident. This specific issue needs to be addressed to prevent similar incidents from occurring in the future.
Describe how the implemented changes did not yield the expected results. Attach any relevant visual data or screenshots to help illustrate the fault.
Provide a detailed description of how the incident affected internal and external users, including the number of support cases raised. It is important to be specific and provide exact numbers.
In this section, describe how the team was able to identify the incident and at what point in time. Additionally, take into account how the time-to-detection could potentially be improved.
In the Response section, we should provide answers to a few more questions. Specifically, we need to know who responded to the incident and what actions were taken to resolve the issue. Identifying any obstacles or delays encountered during the response phase is also important.
A description of what solved the incident should be provided. Include any temporary fixes implemented and the long-term solution, if applicable.
The purpose of the timeline section is to describe any significant events that occurred during the incident. Begin by specifying the timezone, and then proceed to list any lead-up events, starts of activity, the first known impact, and escalations. It is also crucial to record any decisions and changes made during the incident when it was resolved, and any post-incident events, if applicable.
10. Backlog Check
Did any of the work in the Engineering backlog have the potential to prevent or at least mitigate the impact of this incident? Note it.
We have identified the root cause of the incident. Let's take a moment to review previous incidents and determine if the same root cause caused any of them. If so, note them. Then, investigate the actions taken at the time and why they were not effective in preventing this incident from occurring.
12. Lessons Learned
Now that the incident is resolved, the team can answer the following questions:
- What went well?
- What could have gone better?
13. Action items
Note the necessary corrective actions to prevent similar incidents from occurring in the future. Identifying the person responsible for completing the work and the system used to track progress is crucial.
Final thoughts 💭
I hope this blog post has achieved its goal and highlighted the importance of having a well-defined process to handle incidents and, more importantly, the ability to learn from them and improve 🚀
The benefits are countless. Your project will have improved quality, your team will have future references to use and learn from past mistakes, and your clients will have greater transparency regarding what the Incident was, how and why it occurred, and how it was handled.
By prioritizing incident management and postmortem analysis, teams can foster a culture of continuous learning and improvement, ultimately resulting in better products and services for everyone involved.