Companies that move fast put pressure on developers and QA to continually innovate and push software out. This leaves the people with the pager, quite often the same developers, dealing with a continuous flow of production problems. On-call pain is the level of interrupts (pager notifications), plus the level of work that the on-call is expected to perform “keeping the system up” during their shift. How can we reduce this pain without slowing down development or having decrees like “there shall be no errors in our logs”? Assuming there is no time to do overhauls of monitoring systems, or make major architecture changes, here is a step-by-step approach to reducing on-call pain.
Measure On-Call Pain
As always, start out by measuring where you are now and setting a goal of where you want to be.
- Figure out how often your on-call gets paged or interrupted over a large period of time, such as a week or month. Track this number.
- If your on-call is responsible for non-interrupt driven tasks such as trouble tickets, automation, deployments or anything else, approximate how much time they spend on those activities.
- Make a realistic goal of how often you think it’s acceptable for the on-call to get interrupted and how much of their time they should spend on non-interrupt driven tasks. We all want to drive the interrupt-driven work to zero, but if your system breaks several times per week, it is not realistic for the on-call to be that quiet.
- Continuously track this pain metric. Although it may not impact your customers or your product, it impacts the sanity of your employees.
The first step to reducing on-call pain is to systematically reduce the alert noise. The easiest way to do it is to simply ask the on-call to keep track of the noise (alarms that he did not have to fix).
- Remove alarms where no action is required.
- Adjust thresholds for alarms that were too sensitive.
- Put de-duplication logic in place. The same alarm on multiple hosts should log to the same trouble ticket and not keep paging the on-call.
- If you have monitoring software that does flapping detection, put that in place. Otherwise adjust thresholds in such a way to minimize flapping.
Stop Abusing Humans
Any time that you have playbooks or procedures for troubleshooting common problems, ask yourself if you are engaging in human abuse. Most playbooks consist of instructions which require very little actual human intelligence. So why use a human do to them?
- Go through your playbooks and write scripts for everything you can. Reduce the playbook procedure to “for problem x, run script x.”
- Automate running those scripts. You can start with writing crons that check for a condition and run the script and go all the way to a complex auto-remediation system.
Get The Metrics Right
Metrics have the ability to reduce on-call pain, if used correctly. If you know and trust your metrics, you can create an internal Service Level Agreement that is reliable. A breach of that SLA pages the on-call. If you have the right type of metrics and are able to display and navigate them in a meaningful way, then the on-call can quickly focus on the problem without getting inundated with tens of alarms from various systems.
- Create internal SLAs that alarm before their impact is felt by the customer.
- Ensure that the on-calls can drill down from the alarming SLA to the problem at hand.
- Similar to deduping, preventing all related alarms from paging (while still notifying of their failure) relieves pager pain.
- The holy grail here is a system that shows alarm dependencies, which can also be achieved with a set of good dashboards.
Decide On Severity
If an on-call is constantly working in an interrupt-driven mode, it’s hard for him or her to assess the situation. The urgency is always the same, no matter what is going on. Non-critical interrupts increase stress as well as time to resolution. This is where the subject of severity comes in. Define severities from highest to lowest. These might depend on the tools you have, but generally you want three severities:
- Define the highest severity. That is an outage or a major customer facing incident. In this case, the on-call gets paged and engages other stakeholders or an SLA breach pages all the stakeholders at the same time (immediate escalation). This one does not reduce any on-call pain, but it should exist.
- Define the second severity. This is a critical event. When an internal SLA fires and alarm or a major system malfunction happens. It is best practice to define this as an alarm stating that customers are impacted or are going to be impacted within N hours if this does not get fixed.
- Define the third severity. The third severity is everything else. The on-call gets paged for the first two severities (they are interrupt-driven) but the third severity goes into a queue for the on-call to prioritize and work through when they have time. It is not interrupt-driven.
- Create a procedure for the non-interrupt driven work of the third priority.
- Move alarms that do not meet the bar for the second severity into the third severity (they should not page the on-call).
- Ensure that the third severity alarms still get done by the on-call and are handed off appropriately between shifts.
Make Your Software Resilient
I know that I began with statements like “assuming you have no time,” but now the on-calls have more time. The on-calls should spend that time following up on root causes and really making the changes that will have a lasting impact on stability of the software itself.
- Go through all the automation that you have created through this process and fix the pieces of the architecture that have the most band-aids on them.
- Look at your SLAs and determine areas of improvement. Are there spikes during deployments or single machine failures?
- Ensure that your software scales up and down automatically.
I have not covered follow-the-sun on-calls where an on-call shift only happens during working hours and gets handed off to another region of the world. I have also not covered the decentralized model of having each development team to carry primary pagers only for their piece of the world. I have not covered these topics because they share the pain between more people. I believe that a company can rationally make a decision to share the pain only once the oncall pain has been reduced as much as possible. So, I will leave the discussion of sharing the pain for another blog post.