blog に戻る

2022年05月10日 Drew Horn

How Sumo SREs manage and monitor SLOs as Code with OpenSLO

At Nobl9’s annual SLOconf—the first conference dedicated to helping SREs quantify the reliability of their applications through service level objectives (SLOs)—Sumo Logic shared our contribution of slogen to the OpenSLO community, as well as our commitment to OpenSLO as an emerging standard for expressing SLOs as Code.

slogen is an open source, SLO-as-code CLI tool based on the OpenSLO specification. slogen interprets SLOs and alert strategies defined by the OpenSLO specification and automatically creates SLO dashboards and monitors in Sumo Logic. The tool also contains several extensions to OpenSLO based on feedback from customers and our in-house engineering teams.

SLOs help teams define the reliability of their systems and services to make smart decisions about how to build and run applications. Service level indicators (SLIs) are a component of SLOs and measure how a service, typically measured in percentages, is performing.

“The emerging methodologies behind SLOs bring a level of simplification to the very daunting challenge of creating a great digital experience for the users of our systems,” said Christian Beedgen, CTO, Sumo Logic. “To me, OpenSLO acts as a substrate to encode many of the basics of this methodology for everybody to reuse. At Sumo, we are trying to do exactly that with the development of slogen.”

At Sumo Logic, we are actively using slogen to monitor production services and would like to share an example of how our engineering team uses this tool alongside our observability platform to define and monitor service reliability and, frankly, keep on-call pages to an absolute minimum.

Eating our own dogfood: Sumo’s SLOs

Having an observability product ourselves, we draw a hard line in the sand when it comes to dogfooding our platform and tools. Implementing, monitoring, and alerting on SLOs is no exception, and slogen––combined with Sumo’s analytics engine––plays a critical part in how we manage our SLOs as code.

One of the many critical workloads under scrutiny via SLIs/SLOs is our data pipeline, powering anomaly detection of metrics for our Root Cause Explorer. Root Cause Explorer is Sumo Logic’s automated root cause analysis capability. This service accelerates troubleshooting by detecting, contextualizing, and correlating anomalies, or events of interest, at the service, orchestrator, and infrastructure layers of a modern app.

Each bubble in the screenshot below represents an anomaly in an entity and associated metric (e.g CPU utilization). The y-axis position represents the percent drift of the metric from its expected value after factoring in periodicity. High drift events of interest are more serious than lower drift ones.

Example event of interest in Root Cause Explorer

As you can imagine, we want to surface these events of interest to customers as fast as possible since they represent anomalous conditions in the app stack. Drift calculations are an important prerequisite for creating events of interest. While there are several component SLOs in place tied to measuring the reliability of the overall data pipeline responsible for events of interest (e.g. feature engineering, noise reduction, application of hand-crafted rules to prevent false positives, etc.), one important metric is the latency introduced in calculating drift itself. As a result, we define an SLI for drift calculation latency and set an objective that 80% of drift calculation jobs should complete in under 4400 milliseconds.

Enter OpenSLO and slogen! Digging into the examples directory, you’ll find our drift-calculation.yml file used to define our objectives and alert strategy for this measure.

Notice anything interesting? We’ve extended the OpenSLO spec! Line 15 includes the ability to specify a log query to compute an SLI, while lines 31 and 34 give you the ability to create multi-window, multi-burn-rate alerts. The team at Sumo Logic is actively working with the OpenSLO community to work these into the standard specification, but you can use them now with slogen.

With this SLO now defined as code, creating the related content in Sumo becomes devastatingly simple:

slogen path/to/slogen/samples/logs/drift-calculation.yaml --apply

Let’s take a quick look at some of the key content created in Sumo Logic by slogen. First up are the scheduled views:

After running Terraform, we see several scheduled views running log searches and storing the results. This pre-aggregated data in the scheduled views is made available to the monitors and dashboards to support high-performance dashboarding and real-time alerting. Next up are the monitors used to search, parse, compute, and alert on SLOs whenever the short or long window averages breach the alert threshold:

And finally, we have the dashboard specific to this particular SLO:

These visualizations help our team quickly intuit the overall reliability of this service. Below is an overview of some of the dashboards that have been created.

  • Availability: Daily, weekly, and monthly availability measurements against the SLO target.

  • SLO Breakdown: A breakdown of availability and error budget by dimensions important to Sumo Logic to quickly surface and prioritize reliability issues by geographical region and customer tier.

  • Hourly Burn Rate: The burn rate helps to surface specific times of the day when most failures happen.

  • Burn Rate Trend: A trend of today’s burn rate compared to the last seven days.

  • Budget Forecast: A forecast of the remaining error budget.

And that’s how we quantify reliability using SLOs for this service! Feel free to try it out yourself on Sumo with a free account, fork the project to customize, add your platform as a new target, swap out Terraform with Pulumi, propose more extensions to the OpenSLO specification, and hack away! Our training team has also created a video on how to use slogen and create SLOs as code:

We’re thrilled to be a part of the growing OpenSLO community and can’t wait to see where this project goes!

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Sumo Logic cloud-native SaaS analytics

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Drew Horn

Drew Horn

Director, Business Development, ISVs

As a Director of Business Development, Drew is responsible for providing leadership and evangelism for the App Intelligence Partner Program, helping independent software vendors successfully evaluate and integrate the Sumo Logic platform with their solutions.

Drew has over 15 years of experience in IT ranging from early stage startups to Fortune 500 enterprises across engineering, quality assurance, DevOps, customer success, solutions engineering and professional services.

Recently, Drew was the Senior Director of Automation at Applause (a Vista Equity Partners portfolio company) where he spearheaded the GTM strategy, customer success and professional services for their test automation offering. Prior to joining Applause, Drew lead the DevOps team at Amherst InsightLabs, facilitating the delivery and operation of data analytics platforms used to power Amherst's broker dealer, asset management and single family buyer/renter platforms. Drew started his career in InfoSec, helping enterprise network security software development teams build, test and deliver high quality products. He holds a B.S. in Mathematics from the University of Texas, Austin.

More posts by Drew Horn.

これを読んだ人も楽しんでいます