blog に戻る

2020年05月04日 Kevin Goldberg

How to Scale Prometheus Monitoring

After StatsD and Graphite weren’t able to meet their needs for metrics and monitoring, engineers at SoundCloud developed the open source event monitoring and alerting tool, Prometheus. Because it’s easy to deploy and get started with -- and on the surface seems free -- it’s become a popular part of many DevOps teams' observability stack.

As an environment scales, so does the complexity of the Prometheus deployment. Many teams inevitably put more pressure on Prometheus than it was designed to handle. In fact, the Prometheus documentation states it stores data only for a short period of time and was not designed to do otherwise. These expanded use cases and expectations stretch Prometheus and require careful consideration for scaling. Ultimately, Prometheus wasn’t designed to answer questions like these:

  • How can I store all of my data outside of the cluster so that it doesn't fill up the local volumes.
  • Can I aggregate all of my prometheus data access metrics and multiple instances?
  • Can I visualize my many instances in a unified way?
  • And how can I achieve this with minimal overhead and management for my team?

While this scalability problem doesn’t arise when Prometheus is monitoring small or simple deployments, the lack of visibility and unified data adds an extra cost when attempting to use Prometheus as a monitoring source of truth for distributed applications.

Many DevOps teams realize the unavoidable difficulties and instead opt to augment their monitoring with a purpose-built solution. Sumo Logic is used to greatly simplify the challenges related to managing Prometheus at scale including data aggregation, long term data retention, and log and event correlation in a unified service.

Simplified data aggregation

By default, Prometheus servers provide persistent storage, but it was not created for distributed metrics storage across multiple nodes.

Sumo Logic greatly simplifies the process of scaling out a Prometheus deployment. By seamlessly aggregating Prometheus metrics data, Sumo Logic eliminates data silos and allows for global views of the entire cluster.

Aggregate data enables:

  • Global visibility
  • Simplified querying and troubleshooting
  • Simplified capacity planning and prometheus server management within a cluster

Long term data retention

By default, Prometheus only stores data for a short time and isn't designed to do otherwise. According to Prometheus’ docs (emphasis mine), “Note that a limitation of the local storage is that it is not clustered or replicated. Thus, it is not arbitrarily scalable or durable in the face of disk or node outages and should be treated as you would any other kind of single node database. Using RAID for disk availability, snapshots for backups, capacity planning, etc, is recommended for improved durability.”

Sumo Logic takes care of long term storage of Prometheus metrics enabling:

  • capacity planning to monitor how your infrastructure needs evolve
  • chargebacks so you can account and bill different teams or departments
  • analyzing usage trends
  • regulations for certain verticals like banking, insurance, etc.

Faster troubleshooting with logs and metrics

To effectively tie metrics, events, and logs together, the monitoring agent needs to collect and store the events. Prometheus on its own does not collect or store events. It only does metrics.

Visibility of one without the other provides you with incomplete data; you need both to troubleshoot application issues quickly and efficiently.

  • Unified views of logs and metrics: Sumo logic enables users to view, filter and report logs and metrics in one dashboard, as well as overlaying log activity over metrics.
  • Analytics for Troubleshooting: Sumo Logic enables advanced analytics of logs data and metrics data for contextual troubleshooting and quicker root cause analysis of issues.

Augment Prometheus for a dramatically lower cost of ownership

Cost

Running Prometheus in a highly available and scalable way requires a significant investment and engineering talent. Once your environment gets to a certain size you’ll need to allot employees and systems dedicated to running Prometheus rather than innovating on your product. Only a small handful of companies can afford to put resources towards managing support systems instead of projects that contribute to their core business.

Complexity

These solutions tend to run into significant challenges when used for medium and large environments. It is during business-critical moments, like troubleshooting significant issues, that metrics are the most important -- organizations can’t afford to not have them available.

Sumo Logic’s scalability has been proven by thousands of customers who rely on Sumo Logic for operational insight into their logs and metrics. The multi-tenant architecture can ingest and analyze petabytes of metrics logs and event data; the solution also scales on demand to support rapid and elastic growth.

Conclusion

Prometheus is a great solution for collecting performance metrics data, however, for analytics on production deployments you need the reliability and scalability that Prometheus simply wasn’t built to handle. Augmenting Prometheus with Sumo Logic will provide greater value in the long term while giving your teams better performance and observability throughout your stack. 

Scale your Prometheus monitoring with Sumo Logic. Sign up for a free trial.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Navigate Kubernetes with Sumo Logic.

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Kevin Goldberg

Kevin Goldberg

Kevin is the senior technical content manager at Sumo Logic. He has nearly a decade of experience working at high-growth SaaS companies with a focus on IT software previously working for AppDynamics and SolarWinds. Interested in all things tech and sports, you can follow him on Twitter @kevin_goldberg.

More posts by Kevin Goldberg.

これを読んだ人も楽しんでいます