blog に戻る

2020年09月10日 Dave Sudia

Configuring the OpenTelemetry Collector

What is OpenTelemetry?

The OpenTelemetry Collector is a new, vendor-agnostic agent that can receive and send metrics and traces of many formats. It is a powerful tool in a cloud-native observability stack, especially when you have apps using multiple distributed tracing formats, like Zipkin and Jaeger; or, you want to send data to multiple backends like an in-house solution and a vendor. This article will walk you through configuring and deploying the OpenTelemetry Collector for such scenarios.

Why OpenTelemetry?

Over the last few years, there has been a proliferation of observability standards in the cloud-native space. On the tracing side there’s been Jaeger and Zipkin competing for mindshare, and the OpenTelemetry project that attempted to unify those standards. On the metrics side there are Prometheus, Graphite, StatsD, and more. In many organizations many of these tools are deployed in multiple applications. Or, there might be a third-party tool that uses a different standard than the engineering org has decided on. These scenarios create a nightmare for observing a whole system.

The OpenTelemetry project is building tools to wrangle all these different observability producers and receivers, and make them usable for organizations. The Collector tool specifically acts as a universal middleware application for receiving traces and metrics data of various standards, and forwarding them to various backends in the correct standard that backend expects.

OpenTelemetry vs OpenTracing

A (Real) Example of OpenTelemetry

At my company, we use the OpenTelemetry Collector to both centralize and distribute traces and metrics for several reasons.

  1. We use Ambassador as an API Gateway that receives all the inbound traffic to our kubernetes cluster, and forwards it to the correct application. Ambassador supports tracing using Zipkin. However, all of our applications use Jaeger as the tracing library. We also use Jaeger as the backend for analyzing our trace data. For a long time, we could not use the Ambassador trace data because of this, which hurt our ability to troubleshoot incoming traffic.
  2. We have databases managed by a vendor that output Prometheus metrics, but use the Prometheus Operator internally, which does not have a friendly interface for monitoring resources outside the cluster it is in.
  3. We are in the early stages of sending our traces to a vendor, rather than managing them in-house. But for now we need to send them to the vendor and our Jaeger backend.

Without the OpenTelemetry Collector, several things are orphaned, namely:

  • Database metrics aren’t collectable
  • Ambassador traces aren’t collectable
  • We can’t send any traces to the vendor

OpenTelemetry Tutorial: Deploying the OpenTelemetry Collector

The easiest way to get started with the collector is to get the example deployment file from Github. It will create a Deployment for the Collector, and a DaemonSet of agents that will forward to the collector from kubernetes node, along with ConfigMaps for both to provide configuration. With the Kubernetes CLI installed and your cluster setup (which will vary by provider), and the file downloaded, run:

$ kubectl apply -f k8s.yaml

However, almost every user will need to provide a custom configuration. We’ll focus on configuring the collector itself, rather than the Daemonset, which is an optional (but best practice) part of the architecture. The collector is robust and we have experienced no issues with sending data from apps directly to the collector in a production setting.

Configuring the OpenTelemetry Collector

Configuring the Collector for your needs can be tricky the first time, but once you are familiar with the settings, it gets easier. There are five things you need to configure:

  1. Receivers - ports and formats the collector can take in.
  2. Processors - ways to mutate data in the pipeline, like annotating or sampling.
  3. Exporters - endpoints you want to forward data to.
  4. Extensions - offer extra functionality like health checking the collector itself.
  5. Pipelines - tie the other four together into flexible groups so you can specific data to specific places after being processed in different ways.

Let’s look at a config for the example presented above:

apiVersion: v1
kind: ConfigMap
metadata:
 name: otel-collector-configmap
 Namespace: otel
 labels:
   app: opentelemetry
   component: otel-collector-conf
Data:
 otel-collector-config.yaml: |

Receivers

We have three receivers:

  1. Jaeger
    1. Our applications were emitting Jaeger traces in three different ways, all of which the collector can receive.
  2. Zipkin
    1. So we can receive traces from Ambassador.
  3. Prometheus
    1. This scrapes, just like a regular Prometheus instance does. In fact we give it a scrape config, just like we would give Prometheus. It will scrape metrics from these targets and send them through the pipeline.
receivers:
     jaeger:
       protocols:
         grpc:
           endpoint: 0.0.0.0:14250
         thrift_compact:
           endpoint: 0.0.0.0:6831
         thrift_http:
           endpoint: 0.0.0.0:14268
     zipkin:
       endpoint: 0.0.0.0:9411
     prometheus:
      config:
        scrape_configs:
          - job_name: 'databases'
            scrape_interval: 5s
            static_configs:
              - targets:
                  - database1dns:9091
                  - database2dns:9091

That’s a lot, so let’s break it down by section.

Processors

There are five processors:

  1. Memory Limiter
    1. From the docs: The memory limiter processor is used to prevent out of memory situations on the collector. It’s worth reading that whole document.
  2. Probabilistic Sampler
    1. Configurable to sample out traces by percentage to reduce total volume.
  3. Batch
    1. Forwards to exporters in a batch to reduce connection count.
  4. Kubernetes Tagger
    1. This is a third-party processor from the contrib repository. It adds kubernetes metadata to traces for better analysis. The collector is highly extensible, and many companies and individuals are building out the ecosystem.
  5. Queued Retry
    1. Drops incoming metrics and traces into a queue so that if sending fails they can be retried rather than lost.
processors:
     memory_limiter:
       ballast_size_mib: 683
       check_interval: 5s
       limit_mib: 1336
       spike_limit_mib: 341
     queued_retry:
       num_workers: 16
       queue_size: 10000
       retry_on_failure: true
     batch:
       send_batch_size: 1024
       timeout: 5s
     probabilistic_sampler:
       hash_seed: 22
       sampling_percentage: 1
     k8s_tagger:
       passthrough: true

Extensions

We’ve activated two extensions:

  1. Health check
    1. Adds an endpoint for healthchecking the container, useful in Kubernetes for correct deployments.
  2. Zpages
    1. This opens up a port for debugging the container.
extensions:
     health_check: {}
     zpages: {}

Exporters

We’ve created four exporters:

  1. jaeger/1
    1. This forwards traces to the vendor we are working with, who is actually using...the OpenTelemetry Collector with a Jaeger receiver!
  2. jaeger/2
    1. This forwards traces to our in-house Jaeger collector instance.
  3. logging
    1. Logging from the collector is very minimal unless you turn on this exporter. It offers common levels like debug (for verbose information on the data) and info (providing a summary, like number of spans or metrics currently processed).
  4. prometheus
    1. This forwards to another Prometheus instance via remote_write.
exporters:
     jaeger/1:
       endpoint: vendor-otelcol.vendor:14250
     jaeger/2:
       endpoint: "jaeger-collector.jaeger:14250"
     logging:
       loglevel: info
     prometheus:
       endpoint: "prometheus:9090"
       namespace: prometheus-operator

Pipelines

Here is where the magic happens, and the flexibility of the collector shines. We have three export pipelines, one for sending traces to the vendor, one for sending them to our in-house Jaeger, and one for sending metrics to our in-house Prometheus. Note that in these pipelines the order of the processors does matter.

  1. traces/1
    1. This pipeline uses all our tracing receivers, tags, batches, and queues the traces, then sends them to our vendor’s OpenTelemetry Collector. We tag first, then batch, then queue the batched traces for sending. We also log the traces to help with debugging the process of getting them to the vendor.
  2. traces/2
    1. This is almost identical to the first pipeline, but we add the probabilistic sampler processor, so we only forward 1% of the traces on to our internal Jaeger instance. We’ve struggled with managing storage for Jaeger, which is one reason we’re now working with a vendor to manage traces.
  3. metrics/1
    1. This pipeline forwards the database metrics to our Prometheus instance to centralize all our metrics.
service:
     extensions:
       - health_check
       - zpages
     pipelines:
       traces/1:
         receivers:
           - jaeger
           - zipkin
         processors:
           - memory_limiter
           - k8s_tagger
           - logging
           - batch
           - queued_retry
         exporters:
           - jaeger/1
       traces/2:
         receivers:
           - jaeger
           - zipkin
         processors:
           - memory_limiter
           - probabilistic_sampler
           - k8s_tagger
           - batch
           - queued_retry
         exporters:
           - jaeger/2
       metrics/1:
         receivers:
           - prometheus
         processors:
           - memory_limiter
           - batch
           - queued_retry
         exporters:
           - prometheus

Other Uses of OpenTelemetry

The possibilities of the OpenTelemetry architecture in an observability pipeline are nearly endless. The configuration above is centralized, but the collector itself is lightweight. Teams could deploy their own collectors to avoid having to manage many hands working on one configuration.

The collector could be run as a sidecar to an application or a database as easily as run as a standalone service like configured above. In that case, all the collectors would be configured to send their metrics or traces to a central Prometheus or Zipkin/Jaeger, or split to a team’s Prometheus and a central one, etc.

The OpenTelemetry project is also currently exploring adding logging to its domain, meaning soon creating flexible, extensible pipelines for the entirety of the logs/metrics/traces observability trio could be this easy.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Navigate Kubernetes with Sumo Logic.

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Dave Sudia

Dave Sudia

DevOps Engineer

Dave Sudia is an educator - turned developer - turned DevOps Engineer. He's passionate about supporting other developers in doing their best work by making sure they have the right tools and environments. In his day-to-day he's responsible for managing Kubernetes clusters, deploying databases, writing utility apps, and generally being a Swiss-Army knife. He can be found on Twitter @thedevelopnik

More posts by Dave Sudia.

これを読んだ人も楽しんでいます