Outages and postmortems are a fact of life for any software engineer responsible for managing a complex system. And it can be safely said that those two words – “outage” and “postmortem,” do not carry any positive connotations in the remotest sense of the word.
In fact, they are generally dreaded by most engineers. While that sentiment is understandable given the direct impact of such incidents on customers and the accompanying disruption, our individual perspective matters a lot here as well. If we are able to look beyond the damage caused by such incidents, we might just realize that outages and postmortems shouldn’t be “dreaded,” but instead, wholeheartedly embraced. One has to only try, and the negative vibes associated with these incidents may quickly give way to an appreciation of the complexity in modern big data systems.
The Accidental Harmony of Layered Failures
As cliche as it may sound, “beauty” indeed lies in the eyes of the beholder. And one of the most beautiful things about an outage/postmortem is the spectacular way in which modern big data applications often blow up.
When they fail, there are often dozens of things that fail simultaneously, all of which collude, resulting in an outage. This accidental harmony among failures and the dissonance among the guards and defenses put in place by engineers, is a constant feature of such incidents and is always something to marvel at. It’s almost as if the resonance frequencies of various failure conditions match, thereby amplifying the overall impact.
What’s even more surprising is the way in which failures at multiple layers can collude. For example, it might so happen that an outage-inducing bug is missed by unit tests due to missing test cases, or even worse, a bug in the tests! Integration tests in staging environments may have again failed to catch the bug, either due to a missing test case or disparity in the workload/configuration of staging/production environments. There could also be misses in monitoring/alerting, resulting in increased MTTIs.
Similarly, there may be avoidable process gaps in the outage handling procedure itself. For example, some on-calls may have too high of an escalation timeout for pages or may have failed to update their phone numbers in the pager service when traveling abroad (yup, that happens too!). Sometimes, the tests are perfect, and they even catch the error in staging environments, but due to a lack of communication among teams, the buggy version accidentally gets upgraded to production.
Outages are Like Deterministic Chaos
In some sense, these outages can also be compared to “deterministic chaos” caused by an otherwise harmless trigger that manages to pierce through multiple levels of defenses. To top it off, there are always people involved at some level in managing such systems, so the possibility of a mundane human error is never too far away.
All in all, every single outage can be considered as a potential case study of cascading failures and their layered harmony.
An Intellectual Journey
Another very deeply satisfying aspect of an outage/postmortem is the intellectual journey from “how did that happen?” to “that happened exactly because X, Y, Z.” Even at the system level, it’s necessary to disentangle the various interactions and hidden dependencies, discover unstated assumptions and dig through multiple layers of “why’s” to make sense of it all.
When properly done, root cause analysis for outages of even moderately complex systems, demand a certain level of tenacity and perseverance, and the fruits of such labor can be a worthwhile pursuit in and of itself. There is a certain joy in putting the pieces of a puzzle together, and outages/postmortems present us exactly with that opportunity.
Besides the above intangibles, outages and their subsequent postmortems have other very tangible benefits. They not only help develop operational knowledge, but also provide a focused path (within the scope of the outage) to learn about the nitty-gritty details of the system. At the managerial level too, they can act as road signs for course correction and help get the priorities right.
Of course, none of the above is an excuse to have more outages and postmortems! We should always strive to build reliable, fault-tolerant systems to minimize such incidents, but when they do happen, we should take them in stride, and try to appreciate the complexity of the software systems all around us.
Love thy outages. Love thy postmortems.