I am spending a considerable amount of time recently on distributed tracing topics. In my previous blog, I discussed different pros and cons of various approaches to collecting distributed tracing data. Right now I would like to draw your attention to the analysis back-end: what does it take to be good at analyzing transaction traces? As mentioned in the blog above, one of the most important outcomes of adopting open source tracing standards is a freedom to choose the right analysis backend, as long as it supports these standards. So, what is the requirement list for a distributed tracing backend? What should it do and what are absolute must-haves? We have looked at many free, open source and commercial offerings on the market and found a few tools that are good here or there, but nothing would fully match a complete list. And yes, the Sumo platform is very complex, with high end requirements, so the list may not be so demanding for your particular case. Anyway here it goes in no particular order:
Needs to talk distributed tracing concepts - traces, not requests
There’s a certain confusion that becomes quite clear when you try to work with existing distributed tracing/APM products on the market. On one hand they all focus on the most important (agreed) KPI:response time of a microservice. On the other hand, they often overlook that a distributed trace, especially in a microservice-based environment, is almost never a single service job. You can find products that show spans calling them traces, you can see products that show you a trace when you click on a span, all that mess in my opinion stem from the fact that there’s little understanding about how customers would really want to use tracing data. It seems the industry needs to pay a bit more attention to what practitioners have to say in this area, rather than try to squeeze in new data formats into existing flat backend structures.
When I talk to customers and our internal SRE teams I see there are three main distinct use cases:
- investigation of a single transaction, represented by a trace,
- investigation of service relations, represented by a service map and
- health of a single microservice represented by trace-based metrics that are span-level based data.
These three worlds need to interconnect with each other, but not be confused. Traces are traces and represent client transactions, spans are spans and represent single atomic calls to complete the former.
Needs to deal with tons of data
Tracing data on the wire is in fact a form of a structured log, but it puts certain additional requirements on the backend. As discussed before, it is crucial to understand the tracing perspective of that data, and to not merely treat it as a stream of requests/responses (spans). But if you imagine the volumes of such data, the complexity of span relations in a single trace, and the fact that there’s no information about when the trace actually ended, you start to understand the level of demands for the backend that needs to be able to deal with this volume of advanced data structures. For example, in the Sumo Log sSearch platform, we can have millions or tens of millions transactions generated by users in a single minute. Many of these transactions will pass through more than 20 microservices and generate thousands of spans in a single trace. Some of these transactions (like a very complex search query) can take minutes to complete as they often have to scan terabytes of raw data. Try to use any existing free, open source distributed backend tools to cope with that load… best of luck!
Needs to be secure
As already stated, tracing data represent user transactions that often carry personal or sensitive information. Logins, purchases, money transfers all are good examples of mission critical transactions that, if fail, result in a degraded user satisfaction. It is important to monitor them and preserve all available information that can help troubleshoot bad behavior. This information can be sensitive. You don't want to send it to the observability platform not architected with security as a primary concern from the ground up. You want to ensure the vendor taking your money to ingest your data is making all necessary investments and securing their customers’ data. This is even more important when you realize the more business critical and money-bound the transaction is, the more important it is to monitor and track it. So you will find a lot of customers from the financial sector using distributed tracing tools and a lot of requirements put on the tracing backend to have security baked-in from the design phase.
Tracing is not an independent island
Distributed tracing is a very useful concept, traces can carry logs and create metrics. They can be very useful in solving particular types of questions especially related to finding the faulty component when the only thing we know is the user-side complaint or alert. But by no means can they be treated as everything you need to make your application observable. Information about infrastructure, typically in the form of time-series metrics is crucial to understand if the “gear behind” is healthy. In case you found a faulty software component and isolated the problem down to a single microservice, you still need some background information about its state and visibility into its behavior. Things like recent deployments, current version, recent faults, all of that is typically present in logs of investigated software components. Tracing backends should allow quick (one mouse click) pivots to change the investigation scope, in context, to check the health of infrastructure running a microservice and learn more about its state by allowing logs inspection.
Needs to be open and flexible
In my previous blog, I mentioned the importance of owning the observability data by end customers and the benefits it brings to the quality of analysis and agility of building observable applications. I also mentioned that a key element to that puzzle is a good analysis backend that is open, flexible, and supports industry standards. It is always tempting for a software vendor to try to lock the customer into proprietary technology that is hard to replace and talks only to itself. But we in Sumo believe that the market is going in the direction towards the openness and flexibility to accept 3rd party data as first-class citizens in the analytics platform will be treated more as a must-have and a requirement by many customers. We hear that more and more from our customers and agree with them that this is how the market should evolve.
Needs to have a rich analytics layer
I already stated that traces are similar to structured logs. They are in reality a text organized in a certain formatted structure that describes transaction journeys through the application stack and all required details gathered along the way. That structure and all these details create many opportunities for doing interesting ad-hoc analysis of this data for all range of use cases. It is not enough to have a distributed tracing/APM platform that just shows a predefined set of screens with tracing based metrics, list of traces, and single trace views.
What if I want to slice the data and focus only on a specific set of traces?
What if I want to analyze that data by my breakdowns defined by custom tags?
What if I add my own custom metrics into the span data?
What about free text search among tracing data?
All of these questions were raised when we tried to match our internal use cases and requirements to capabilities of products available on the market. With mixed luck...
We believe there’s a gap in the market. We have lots of specialized APM tools that talk their own language, are very inflexible, and closed. We also have a lot of open source tools that are insecure and not scalable. When researching this market, you will find a few vendors that come from their areas of expertise, very distant from distributed tracing, and don’t quite understand nor appreciate use cases that can/should be solved with tracing data. Then we have tracing-only vendors that preach that distributed tracing is all you need.
We miss a good analysis backend for open standard compatible tracing data. A platform where tracing would be understood but also treated as a first-class citizen together with other types of time-series and text-based observability signals. A platform that would be secure, scalable, and offer enough analytics use cases to address a variety of today's investigation needs. If you have similar thoughts, let us know, and let’s talk!
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.