When many of our customers discover real-time alerts, they’re usually so excited to have that kind of visibility into their systems that their first reaction is to set up alerts for whatever they can think of.
Tracking everything from critical application errors to shopping cart abandon events might seem like a great idea—until you find an endless stream of alerts bombarding your inbox. Indiscriminate alerting makes it nearly impossible to identify important events, let alone actually use them to fix your system. In the worst cases, we’ve even seen users create filters to automatically send alerts to their trash bin.
If you find your primary role is turning into more of an “alert manager” rather than a developer or IT administrator, this post is for you. We’ll take a look at two guiding principles for designing meaningful alerts, as well as some hands-on examples in Sumo Logic.
What Makes an Alert Meaningful?
How do you know when your alerts aren’t effective? The simple answer is that if you’re ignoring them, it’s time to go back and re-assess your alerting policy. Each of your alerts should have two key traits: they should be actionable and directed.
Alerts Should Be Actionable
Defining an alert answers the question, “What metrics do I care about?” For example, you might want to trigger an alert after a certain number of failed logins from a single IP address. But, simply answering “what” isn’t enough for an actionable alert.
You also need to ask questions like “Why do I care about these metrics?” and “How do I respond when they hit a critical level?” For a failed login alert, you’ll probably want to run other queries like checking the location of the client IP to determine if it’s a malicious user. If it is, you would then block that IP address.
Alerts aren’t just for monitoring your system—they’re for telling you that you need to do something. Every alert should have an associated playbook that defines the steps to take when you receive the alert. If you can’t define this playbook, the alert probably isn’t as important as you initially thought.
Alerts Should Be Directed
Once you’ve determined your alerts are actionable, you need to make sure someone is around to perform all those actions. You don’t want to let first-responders ignore alerts, and you really don’t want your alerts sent to somebody’s spam folder.
Each alert should be directed to an individual who is accountable for handling it. It doesn’t matter whether this is a developer, an application manager, an IT administrator, or the CTO. The point is, an alert needs a clear owner so that they can assess the situation, triage as best they can, attempt to identify the root cause, and escalate if necessary.
As with playbooks, if you can’t direct an alert to a specific individual, odds are you don’t really need it.
A Web Application Case Study
An alert that is both actionable and directed is a very powerful tool. It lets Sumo Logic monitor your system for you, filtering out all the noise, and only letting you know when a human needs to intervene. However, tuning alert parameters to make sure they’re actionable can be tricky.
A big part of crafting actionable alerts is making them dynamic. Instead of setting static thresholds, dynamic alerts can react to changing environments, which helps eliminate false-positives. The rest of this article takes a look at two real-world examples of dynamic alerting.
Static 404 Error Monitoring
But, before we get into dynamic alerts, let’s take a look at why static alerts can prove troublesome. Monitoring status code errors is a common use case for real-time alerts. For example, a spike in 404 errors probably means you included some broken links in your most recent code push.
| parse "HTTP/1.1" * " as status_code
| where status_code = 404<
A naive implementation would simply set a static threshold for 404 errors. For instance, you might use the above query to have Sumo Logic trigger an alert whenever fifty 404 errors occur in any 20-minute interval. However, if your traffic is cyclical or volatile, this could result in a lot of false-positives. And, the problem with false-positives is that they’re not actionable.
In reality, you don’t care about any absolute number of 404 errors. You actually want to know when you have an “abnormal” amount of 404 errors. With dynamic alerts, there’s all sorts of ways to define “abnormal.”
Dynamic 404 Error Monitoring with the Outlier Operator
Sumo Logic’s outlier operator tracks the moving average of a value and detects when new values lie outside some multiple of the standard deviation. This lets you monitor rates of change and volatility.
| parse "HTTP/1.1" * " as status_code
| where status_code = 404
| timeslice 1m
| count by _timeslice<
| outlier _count<
We can make our alert more dynamic by detecting abnormally high increases in 404 errors. The above query eliminates some of the false-positives by looking at the increase in 404s over a given period of time instead of an absolute threshold. The example below was configured to allow a higher degree of change in the quantity of 404s over time.
The idea is to reduce the amount of noise by automatically adapting to cyclical changes and natural growth (or decreases) in your system volume. Of course, this isn’t a silver bullet. A rapid influx of traffic will often be associated with a rapid increase in 404s, so this query can still result in false-positives. For even better results, we need to consider the rest of our web traffic when analyzing 404 errors.
Even More Intelligent Alerting
404 errors are typically correlated with your total web traffic. When you have more visitors, you’ll often have more 404 errors, too. You can incorporate this relationship into an alert by comparing 200 status codes with 404 status codes over time:
_sourceCategory=Apache/Access (status_code=200 or status_code=404)
| timeslice 1m
| if (status_code=200, 1, 0) as sc_200
| if (status_code=404, 1, 0) as sc_404
| sum(sc_200) as sc_200count, sum(sc_404) as sc_404count by _timeslice
| sc_404count/sc_200count as sc_ratio
| sort _timeslice desc
| outlier sc_ratio window=10, threshold=2, consecutive=1, direction=+
This query calculates the ratio of 404 status codes to 200 status codes. As long as your 404 errors are increasing at a rate similar to your total traffic, this ratio stays the same, and you don’t have a problem. But, when your 404 errors spike without a corresponding increase in 200 status codes, this is cause for concern. When this happens, sc_ratio will rise. By detecting this change with outlier, you can create a dynamic alert that only fires when you’ve broken your code.
Of course, you’ll probably want to take some time to tune the outlier parameters to match your own website traffic patterns, but incorporating correlations with other traffic metrics like this is a huge leap forward in making alerts more meaningful.
Your Ideal Alerting Scenario
This article was about finding the right balance between visibility and utility. If your alert constraints are too strict, they might not catch important events in your system. If they’re too lax, people start ignoring alerts, and all that time and energy you put into setting them up are for naught.
Your ideal alerting scenario depends largely on your product and how many man-hours you can devote to managing alerts. But, a good benchmark is to make sure every alert has an associated playbook of actions and that they get directed to an individual who is able to perform those actions. These two simple principles will help ensure you’re getting the most out of your alerts.