I have a confession to make. I was going to use this blog to wax poetically about the flaw inherent in the well-established notion of the Three Pillars of Observability. I was going to argue that a fourth pillar was badly needed: topological metadata. In other words, without annotating each data stream across the three common data types with a pointer to where that data in the deployment topology was coming from, all you have is a lot of data with no chance of making sense of it. Clutching at metaphors, I was going to tell you that, clearly, a Three-Legged Stool is a dangerous, unstable mess.
The joke is on me though––maybe I should have paid more attention in school (or life in general, vs. watching The Witcher or counting out the time signatures of the songs on the new Tool album...) Because it turns out that a three-legged stool is actually pretty freaking stable, whereas a device with four legs can be rather wobbly. I will leave you with this gem of a webpage for the details. Like everything, once you understand the intuition, then it’s pretty easy to understand. So now that reality has entirely devastated my blog hook, I will not talk about the Four Pillars of Observability. Instead, I will simply share a bit more about how we internally look at the relationship between some of our favorite and frequently used terms: Observability, Monitoring, and Reliability.
Have a look at this:
Observability has given us an umbrella under which to understand the need to combine all the signals produced by the complex systems we are growing and feeding, independent of the data type. It has also given us a framing to acknowledge that those systems are in many ways black boxes and signals formatted as logs, or metrics, or traces at best allow us to approximate the actual state of the underlying processes. But hey, we gotta start with the data we have, and we can also probably refer somewhat confidently to the fact that the ability to measure it is a foundation of the scientific process and our understanding of reality, whether created by the universe or ourselves. But in the end, I believe that Observability is merely a means to an end.
And Monitoring is a means to an end as well, I think, no matter how much it’s in the foreground for many of us. So what is the end game then? I am not enough of a nihilist to believe that there is not an end to this. The end is to have reliable systems. But is it? Rather, systems that are available and are performing create a competitive customer experience. Simply put: the business need for reliability is the driver for everything we do. It is the ultimate end. This applies to all of you out there trying to tame the beasts you have created or inherited. It doubly applies to us at Sumo, as we have the excitingly recursive challenge of not only having to tame our own beasts, but our stable of beasts are also in your service in the shape of our product, delivered as a service to help you monitor for Reliability.
So from the top, to achieve Reliability, you have to establish a Monitoring practice. A lot goes into establishing a successful monitoring practice, but tooling is usually a part of it. We strive to be the best tool for the job and your trust in our ability to deliver our service with the needed functionality, availability, and performance matters to us a great deal. What to do with the tooling, then? On the highest level, tooling helps you establish a way to know when something is wrong by means of alerting. There is a lot more to drill down on this topic in detail, and we will be talking about it more this year. But in general, once you have an idea that something is wrong, you need to be able to troubleshoot. You need to be able to approximate an understanding of what has happened, or what is still happening. You need to be able to hypothesize and use data validate ideas about those black boxes. The goal of troubleshooting is to make things right. That means that usually, the focus is on restoring service, because more than anything else, you need to be able to undo any effect on customer experience.
Of course, we recommend you go further than just restoring service, and make an effort towards finding the root cause. Your tooling will continue to be extremely important in supporting this effort because the tooling aggregates the observable signals across the various data types, or pillars, just like it uses these signals to drive alerting in the first place. And to loop all the way back to the beginning, we believe it is useful (and obvious!) that the signals on their own are not very useful until you annotate them with topological information. Topology is a fancy term, but all it means is that there is usually a hierarchical, often mostly virtual representation to the way that systems are deployed. We commonly express it by tagging, adding metadata during the process of creating resources, or attaching metadata during the collection process. Our new Continuous Intelligence Solution for Kubernetes leverages all the information and models present in your Kubernetes deployments, for example. Knowing that a metric, or trace, or log stream is coming from a particular environment, cluster, namespace, microservice, instance, or belongs to a particular team is the final piece in the puzzle.
Let us know what you think. In the meantime: Follow the evidence/Look it dead in the eye.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.