Platform architects, SREs, developers and DevOps staff for mission-critical modern apps know shaving 15 minutes off service incidents that take 20 minutes to resolve, four times a year, is the difference between meeting a 99.99% availability objective and missing it. After a successful worldwide preview, Sumo Logic Observability is now ready for site reliability, DevOps, developers and platform engineers to resolve incidents faster, maximize availability, and optimize their cloud infrastructure, microservices, and application operations for reliability objectives.
Reliability as an outcome is not new. Modern applications rely on diverse combinations of cloud infrastructure, container and orchestration tools, and technologies. These increasingly distributed environments are inherently complex to troubleshoot. Dependencies between layers in an application stack such as between microservices and their underlying cloud resources make troubleshooting cumbersome. Modern applications also emit huge amounts of logs, metrics, traces, and metadata: it’s not unusual for an application to generate hundreds of gigabytes of logs per day, several tens of thousands of time series per minute, several million traces per day and metadata from hundreds of app and infrastructure entities. Even mature Development teams like Sumo Logic are faced with new problems more often than not -- because, simply put, unknown behaviors and failures are a property of distributed systems you have to deal with. Furthermore, all data used for troubleshooting needs to be protected and secured.
In what follows, we describe the latest Sumo Logic Observability innovations starting with an example of troubleshooting an incident in a modern application. Consider the highly simplified mobile banking application shown below. In this example, the app is built on Kubernetes and AWS infrastructure. But of course, this same process can be applied to other technology stacks or application architectures. Consumers trigger bill payments transactions from the AWS Application Load Balancer (ALB) to the payment-service orchestrated by Kubernetes. The payment-service posts transactions to the accounts-service (another Kubernetes service) which stores transactions in the RDS database.
An elevated error rate for the payment-service would be the first sign of trouble that triggers an alert to an on-call engineer. The engineer would have to hypothesize and diagnose several scenarios that might be causing the elevated errors:
- A problem with the payment-service itself or the accounts-service it depends on
- A problem with the Kubernetes deployment or pods
- A problem with cloud nodes (e.g. AWS EC2 instances) used by the pods
- A problem with any other cloud infrastructure or platform service (e.g. AWS RDS, AWS ELB)
Suppose the on-call engineer determines excessive connections to the RDS instance overloaded the database and resulted in higher latency in the accounts-service response time which then resulted in payment-service errors. While the immediate resolution might involve provisioning additional or larger RDS instances, deeper troubleshooting is required to determine why the RDS instance got into such a state in the first place. The latter may be caused by poorly written queries, underlying AWS issues, software flaws (e.g. connections that were left open), or bad architecture (e.g. single points of failure).
Sumo Logic Observability generalizes the workflow implied in this example to the three broad stages of delivering to reliability outcomes highlighted in the figure below:
- Monitor critical service-level indicators against their objectives such as errors, budgets, or desired latency.
- Diagnose the issue causing the service-level objective violation to narrow down the location of the issue to an application service or infrastructure component
- Troubleshoot the underlying service or infrastructure component to uncover the root cause, restore the service, and to ultimately eliminate the root cause to avoid future failures
Of course, none of this works without collecting logs, metrics, traces, and metadata at the application, microservices, cloud, orchestrator, and container layers. These datasets, by themselves, are merely siloes; to accelerate troubleshooting, as shown in the example, the user should be able to connect the dots between logs, metrics, and traces by pivoting on entities (either services or resources) from the initial alert (from an error log, in the example) to a microservice transaction trace (e.g. the payment-service or accounts-service) to a metric for a Kubernetes pod or deployment or an AWS resource.
Sumo Logic Observability’s entity-driven workflow is at the core of capabilities for monitoring, diagnosing, and troubleshooting modern apps as described below.
Sumo Logic Observability combines logs, metrics, and trace datasets into a single platform and leverages an entity model that enables users to correlate signals between logs, metrics, and traces as they go from an alert to root cause. These entities are discovered automatically from the metadata across logs, metrics, and traces generated by the application and it’s infrastructure.
As it relates to monitoring, Sumo Logic Observability now includes:
- Unified Alerting, across logs and metrics data sources with the ability to specify alert criticality, configure detection rules, set up multiple channels for receiving notifications, auto-resolve incidents, and a central landing page to triage, administer, and manage alerts.
- AWS Observability features 40+ dashboards to monitor infrastructure on AWS in a comprehensive and intuitive manner across AWS accounts, regions, and resource types down to individual entities.
For diagnosing incidents, Sumo Logic Observability now includes:
- Transaction Tracing to observe apps and microservices to the level of individual requests and pinpoint issues with particular microservices. Our OpenTelemetry standard-based tracing capabilities provide an open and flexible standard for observability of microservices transactions without vendor lock-in.
- Re-vamped Metrics Explorer that decreases the complexity of finding and visualizing your metrics data with a new structured query builder, and an extended range of visualizations for ad-hoc analysis. Mimicking the Dashboard (New) workflow, you now have the same unified experience in the main metrics tab.
- Global Intelligence for AWS CloudTrail DevOps helps on-call staff identify impact caused by AWS errors (e.g. availability, throttling, out of stock) as the probable cause for their incidents
For troubleshooting incidents, Sumo Logic Observability now includes the following advanced analytics innovations:
- Root Cause Explorer, an AIOps breakthrough that helps on-call staff accelerate troubleshooting and root cause isolation for incidents in their apps and microservices running on AWS by detecting anomalies in 500+ AWS CloudWatch metrics and automatically categorizes anomalies by incident timeline, AWS account, region, namespace, entities, AWS tags, and more dimensions.
- Behavior Insights leverage machine learning to detect patterns, outliers, and changes in underlying service behavior to isolate and automatically explain root causes of application issues
Underpinning these capabilities is expanded support for Open Source frameworks including OpenTelemetry for tracing data and Telegraf for increasing the breadth of technologies we collect metrics from. Our existing Redis and NGINX apps are now enhanced to leverage logs and metrics. We have also added new apps for JMX and NGINX Ingress Controller, a common component in Kubernetes stacks.
To support observability outcomes without breaking budgets, Sumo Logic Observability now includes the ability for customers to tier data based on analytics requirements, an industry-first credits-based licensing model for ultimate flexibility and cardinality-independent pricing for ephemeral resources in container environments. Sumo Logic platform is end-to-end encrypted, has a 24x7 security operations center, and is certified and attested for PCI-DSS, HIPAA, AICPA-SOC2, ISO 27001, GDPR, and FedRamp (in-progress).
In subsequent blog posts, we will delve into additional details for each of these capabilities.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.