[OpenTelemetry in Kubernetes: From Installed to Actually Working]
Analyze with AI
Get AI-powered insights from this Mad Devs tech article:
The gap between "OTel installed" and "OTel working" is where most teams get stuck.
The official documentation is strong for first contact: start the SDK, point it at a Collector, and see spans in Jaeger. What the documentation does not prepare you for is a real Kubernetes cluster with a dozen services, a mix of Python and Go, a trace that silently stops at the boundary between two microservices, and a finance team asking why trace storage costs tripled after rollout.
This guide targets DevOps and SRE engineers who already understand the concepts but have not landed a production OTel implementation that the team actually trusts. The goal is specific: a minimally reliable pipeline for Kubernetes distributed tracing, with correct context propagation and tail-based sampling that keeps storage costs predictable. Running OpenTelemetry in Kubernetes reliably requires decisions that hello-world examples do not force you to make — this guide surfaces them before they become incidents.
One framing that helped us think about this: distributed traces are to microservices what session traces are to AI agents — a way to reconstruct what happened across a sequence of operations that no single participant has full visibility into. The context propagation mechanics map directly to the pattern as explored in our earlier piece on session traces.
What production OTel needs beyond hello-world
Hello-world OTel has three components: SDK in the application, Collector running somewhere, and backend receiving the data. This works for demos. Four requirements change the architecture when you move to production.
Collector as a control point. Applications should not talk directly to the backend. They should not carry the cost of retries, batching, or data transformation. The Collector is where you gain control over the telemetry pipeline: sampling decisions, attribute enrichment, redaction, fan-out to multiple backends, without touching application code. In production, the Collector becomes the control point for the telemetry pipeline.
Kubernetes metadata enrichment. Your pods do not know which node they run on, which deployment version they belong to, or what team label is on their namespace. The Collector does, through the k8sattributes processor. Enriching spans with this context (pod name, deployment name, namespace, and team) is what makes "high latency in service X" turn into "high latency in service X on node Y after the 14:30 rollout of version Z."
Propagation consistency. A trace that crosses a service boundary works only if both services speak the same propagation format. Mixing W3C TraceContext with B3 or Jaeger's format silently breaks traces. This needs to be decided and enforced before the first service goes to production, not discovered six months later when traces look short.
Sampling and cost control. Full trace data at production load can quickly become a storage and retention problem. The answer is not to disable tracing — it is tail-based sampling at a centralized point, which lets you keep 100% of errors and slow traces while sampling healthy fast traces at a low rate. This requires architecture decisions that hello-world does not need.
Reference architecture: agent + gateway + backend
Production OTel on Kubernetes runs two tiers of Collectors.
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────┐ OTLP/gRPC ┌─────────────────────┐ │
│ │ Service A │──────────────▶│ OTel Agent │ │
│ └──────────┘ │ (DaemonSet) │ │
│ │ │ │
│ ┌──────────┐ OTLP/gRPC │ Per-node collection │ │
│ │ Service B │──────────────▶│ k8s metadata attach │ │
│ └──────────┘ │ Batch + forward │ │
│ └──────────┬────────────┘ │
│ │ OTLP/gRPC │
│ ┌──────────▼────────────┐ │
│ │ OTel Gateway │ │
│ │ (StatefulSet, 2+ pods)│ │
│ │ │ │
│ │ Tail-based sampling │ │
│ │ Attribute redaction │ │
│ │ Fan-out to backends │ │
│ └──────────┬─────────────┘ │
└────────────────────────────────────────┼─────────────────┘
│
┌────────────────────┼──────────────┐
┌──────▼──────┐ ┌────────▼──────┐ ┌───▼──────────┐
│ Tempo / │ │ Prometheus │ │ Loki / │
│ Jaeger │ │ / Mimir │ │ Elasticsearch│
└─────────────┘ └───────────────┘ └──────────────┘Agent tier (DaemonSet). One Collector pod per node. Applications send telemetry to the agent over OTLP gRPC. The agent adds Kubernetes metadata through the k8sattributes processor, then batches and forwards to the gateway. Resource consumption is node-bounded: the agent only handles traffic from pods on its node.
Note on DaemonSet and
localhost: pods in a DaemonSet deployment do not automatically reach the agent via localhost. Applications should use the node's internal IP, or configure the agent Service with hostPort or a node-local Service. The simplest production pattern is an environment variable injected by the Downward API using status.hostIP, then referenced in the exporter endpoint: OTEL_EXPORTER_OTLP_ENDPOINT=http://$(NODE_IP):4317.
Gateway tier (StatefulSet). Two or more replicas. The gateway is where tail-based sampling lives — it needs to see all spans for a trace before making a keep/drop decision, which requires stable addressing (more on this in the sampling section). The gateway also handles attribute redaction and fan-out to multiple backends.
Why not skip the agent and send directly to the gateway? You can. But you lose node-level Kubernetes metadata enrichment, you push retry and batching logic into the application, and every service needs the gateway's address hardcoded. The agent is a node-local or cluster-local collection point, depending on how you expose it: applications send to a fixed local address, and the Collector handles everything else.
Deployment model options. The right model depends on your cluster size and requirements.
| MODEL | USE WHEN | TRADE-OFF |
|---|---|---|
| DaemonSet agent + gateway | Production default for multi-service clusters | More moving parts; correct and scalable |
| Sidecar per pod | True localhost endpoint required; strict isolation | High resource overhead at scale |
| Gateway only | Small cluster, pilot, or early evaluation | No node-level enrichment; single point of congestion |
| Direct to backend | Local development only | No sampling, no enrichment, no control |
For the agent to be reachable at the node IP, expose it with hostPort or hostNetwork on the DaemonSet, and restrict access with NetworkPolicy. If you expose the agent through a Kubernetes Service instead, do not assume node-local routing unless your Service configuration explicitly enforces it.
Application instrumentation and Collector configuration that survive production
Instrumentation minimum
Before configuring the Collector, the services need to send data. The practical minimum:
- Deploy the Collector via the OTel Collector Helm chart or the OpenTelemetry Operator. Both support DaemonSet and Deployment modes and handle upgrades without manual manifest management.
- Set
OTEL_SERVICE_NAMEandOTEL_RESOURCE_ATTRIBUTES(at minimumdeployment.environment) in pod environment variables — do not hardcode them in application code. - Use auto-instrumentation for HTTP, gRPC, and database calls. The OTel Operator supports zero-code injection for Python, Java, and Node.js via pod annotations, which means no SDK changes for existing services.
- Add manual spans only for business-critical operations with no HTTP or DB analog: payment steps, batch processing stages, cache decision points.
- Set propagators explicitly and consistently across all languages. W3C TraceContext (
traceparent) is the standard; commit to it before the first cross-service call.
Agent pipeline: OpenTelemetry Kubernetes example
# collector-agent-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.namespace.name
- k8s.node.name
labels:
- tag_name: app.version
key: app.kubernetes.io/version
from: pod
- tag_name: team
key: team
from: pod
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: connection
filter/drop_health_checks:
error_mode: ignore
traces:
span:
- 'attributes["http.route"] == "/healthz"'
- 'attributes["http.route"] == "/readyz"'
- 'attributes["http.route"] == "/metrics"'
batch:
send_batch_size: 512
timeout: 5s
exporters:
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
k8s:
service: otel-gateway-collector-headless
ports: [4317]
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, filter/drop_health_checks, batch]
exporters: [loadbalancing]The examples use tls.insecure: true for readability inside a trusted cluster. In regulated or multi-tenant environments, use TLS/mTLS between agents, gateways, and backends, and restrict OTLP endpoints with NetworkPolicy.
Processor order is not optional. memory_limiter must be first — it is the circuit breaker that prevents OOM under load. k8sattributes runs before filter so that filtering can reference Kubernetes attributes. batch runs last, immediately before export, so it accumulates enriched and filtered spans.
Health check filtering belongs in the agent, not the gateway. Filtering /healthz and /readyz here prevents those spans from entering the tail sampling buffer in the gateway — wasted memory if they are never kept.
loadbalancing exporter, not plain otlp. Tail sampling requires that all spans in a trace reach the same gateway replica. The loadbalancing exporter handles this by routing spans by trace ID — see the tail sampling section for the full explanation and StatefulSet requirement.
Gateway pipeline configuration
# collector-gateway-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 1000
policies:
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
- name: keep-slow
type: latency
latency:
threshold_ms: 2000
- name: keep-debug-flagged
type: string_attribute
string_attribute:
key: sampling.priority
values: ["always_on"]
- name: sample-normal
type: probabilistic
probabilistic:
sampling_percentage: 5
attributes/redact:
actions:
- key: db.statement
action: update
value: "[redacted]"
- key: http.request.header.authorization
action: delete
- key: http.request.header.cookie
action: delete
batch:
send_batch_size: 1024
timeout: 10s
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
service:
extensions: [health_check, pprof]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, attributes/redact, batch]
exporters: [otlp/tempo]Expose the Collector's own telemetry. The health_check extension provides a /health endpoint for Kubernetes readiness probes. The pprof extension is invaluable when debugging Collector memory or CPU spikes. Enable both; they cost nothing under normal operation.
Redaction at the gateway, not the agent. The agent enriches; the gateway sanitizes. This separation means you can add new redaction rules without touching per-node config.
Why traces break at service boundaries
Context propagation is the mechanism by which a trace follows a request across service calls. Service A creates a span, injects a traceparent header into the outbound request, and Service B extracts it and creates a child span under the same trace ID. When this breaks, you see two disconnected root spans instead of one trace. There is no error — just a silent gap.
Three categories cover most production failures.
1. Propagation format mismatch
W3C TraceContext (traceparent) is the standard. B3 (Zipkin's format, used by older Java and Spring frameworks) and Jaeger's uber-trace-id are still common. A Python service using W3C calling a Java service using B3 produces two root spans with no connection.
Fix: set propagators explicitly at startup across all services, and choose one format before the first cross-service call.
# Python -- set globally before any HTTP clients initialize
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositeHTTPPropagator
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
# During migration: support both; drop B3 once all services are aligned
set_global_textmap(CompositeHTTPPropagator([
TraceContextTextMapPropagator(),
B3MultiFormat(),
]))
// Go -- set at TracerProvider initialization
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
b3.New(),
))2. Async jobs and background workers
HTTP auto-instrumentation injects and extracts context automatically. Celery tasks, Kafka consumers, cron jobs, and queue workers receive no context by default — they start new root spans because there is nothing to extract.
Fix: explicitly capture context when enqueueing and restore it when consuming.
from opentelemetry import trace, propagate
# When enqueueing: capture current trace context into the task payload
def enqueue_order_processing(order_id: str):
carrier = {}
propagate.inject(carrier) # {"traceparent": "00-abc123...-1"}
process_order.delay(order_id, carrier)
# In the Celery task: restore context before doing any work
@celery_app.task
def process_order(order_id: str, carrier: dict):
ctx = propagate.extract(carrier)
with trace.get_tracer(__name__).start_as_current_span(
"celery.process_order",
context=ctx,
kind=trace.SpanKind.CONSUMER,
) as span:
span.set_attribute("order.id", order_id)
do_processing(order_id)The same pattern applies to Kafka (store the carrier in message headers), SQS (store in message attributes), and any other queue-based handoff.
3. Header loss at ingress and service mesh
Ingress controllers and API gateways may strip unknown headers or overwrite traceparent with their own trace ID. Istio and Linkerd handle mTLS at the proxy layer but do not automatically forward application tracing headers.
Fix for Nginx Ingress:
yaml
nginx.ingress.kubernetes.io/configuration-snippet: |
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;Fix for Istio: align the mesh tracing configuration with the propagation format used by your applications, and verify that proxies do not replace or drop traceparent / tracestate headers on the path between services. For services that construct HTTP clients manually (not using auto-instrumented libraries), always inject context explicitly:
from opentelemetry import propagate
import httpx
async def call_downstream(ctx, payload):
headers = {}
propagate.inject(headers) # always; even if you think the library does it
async with httpx.AsyncClient() as client:
return await client.post("http://downstream/api", json=payload, headers=headers)Propagation diagnosis checklist
When a trace looks short or a service appears as a root span unexpectedly:
[ ] Log incoming request headers in the downstream service
→ Is traceparent present? Does the trace-id segment match upstream?
[ ] Check propagator configuration in both services
→ Are they using the same format?
[ ] Check the ingress and any API gateway in between
→ Are traceparent and tracestate in the allow-list of forwarded headers?
[ ] If using a service mesh, check proxy tracing configuration
→ Is the mesh configured to forward (not replace) the application trace header?
[ ] For async jobs: is the carrier captured at enqueue time and restored at consume time?
[ ] Check if the Collector agent's filter is dropping spans from the service in questionOpenTelemetry tail-based sampling: keeping costs under control without losing signal
Head-based sampling — deciding at trace start whether to keep a trace — is simple but blind. It cannot factor in whether the trace ended in an error. A flat 10% head-based sample keeps 10% of errors — unless you use parent-based sampling with an always-on sampler for error paths, which requires SDK-level changes in every service. You keep noise and lose signal in equal proportion. Kubernetes cost optimization with OpenTelemetry starts here: tail-based sampling is the lever that lets you cut storage spend without cutting visibility into failures.
Tail-based sampling waits until the trace is complete, then decides. Keep all errors. Keep all traces above a latency threshold. Sample everything else at a low rate. This is the strategy that gives you the best chance of preserving the traces that matter — errors, slow requests, and explicitly flagged flows — provided routing, buffering, and decision_wait are configured correctly.
Tail sampling configuration
The full gateway pipeline configuration is in the previous section. Key parameters:
tail_sampling:
decision_wait: 10s # How long to buffer spans before deciding
# Must exceed your slowest async operation's duration
num_traces: 50000 # Max traces buffered simultaneously -- set this carefully
policies:
- name: keep-errors # 1. Always keep error traces
type: status_code
status_code:
status_codes: [ERROR]
- name: keep-slow # 2. Always keep traces above the latency threshold
type: latency
latency:
threshold_ms: 2000
- name: sample-normal # 3. Sample the rest
type: probabilistic
probabilistic:
sampling_percentage: 5The processor evaluates all policies independently. A trace is kept if at least one policy votes to sample it and no policy votes to drop it — so keep-errors will always win over the probabilistic fallback.
decision_wait must exceed your longest async path. If a Celery task takes up to thirty seconds, set decision_wait: 35s. A trace that has not fully arrived when the timeout fires gets a decision made on partial information — usually a drop, because partial traces have no latency signal and may not have an error yet.
The multi-gateway problem
Tail sampling requires all spans for a trace to reach the same Collector instance. A multi-replica Deployment load-balances round-robin by default — different spans for the same trace land on different replicas, each seeing a partial trace and making a wrong sampling decision.
Fix the route by trace ID using the loadbalancing exporter in the agent (as shown in the agent config above). The agent hashes the trace ID and consistently routes to the same gateway pod. This requires the gateway to be a StatefulSet with a headless Service so individual pod addresses are DNS-resolvable.
# Gateway as StatefulSet -- required for loadbalancing exporter to resolve pod addresses
apiVersion: opentelemetry.io/v1beta1 # verify the version supported by your installed Operator
kind: OpenTelemetryCollector
metadata:
name: otel-gateway
spec:
mode: statefulset
replicas: 2
# The Operator may or may not create a headless Service automatically depending
# on the installed version. Check with:
# kubectl get svc -l app.kubernetes.io/component=opentelemetry-collector
# If no headless Service exists (ClusterIP: None), define one explicitly.Memory sizing rule
The num_traces limit determines peak memory. Rough calculation:
num_traces × avg_spans_per_trace × avg_span_size_bytes = peak memory
Example:
50,000 × 20 spans × 1 KB = ~1 GB
Set memory_limiter to 80% of pod memory limit.
Set num_traces to fit within 70% of that.
Add 2× headroom for traffic spikes.At 1,000 traces/sec with a 10s decision_wait, you need at least 10,000 traces in the buffer at any moment. 50,000 gives 5× headroom for bursts. If your burst factor is higher, increase num_traces first, then scale pod memory.
Operational checklist before rollout
This is not a validation checklist for the technology; it is a readiness checklist for the team running it. The specific failure modes behind each item are covered in the Limitations section below.
Collector self-observability
[ ] health_check extension enabled on agent and gateway
[ ] Kubernetes readiness probe points to /health endpoint
[ ] Collector's own metrics scraped by Prometheus
[ ] Alert on Collector pod memory approaching its limit
→ memory_limiter drops spans before OOMKill; alert before it gets there
[ ] Alert on exporter send failures and queue saturation
→ otelcol_exporter_send_failed_spans is stable across versions
[ ] Alert on refused or dropped spans at the receiver
[ ] Alert on tail-sampling processor eviction, late-span, and decision metrics
→ metric names vary by Collector version; check your installed version's telemetry docs
→ https://opentelemetry.io/docs/collector/internal-telemetry/Sampling validation
[ ] Generate a test trace that results in an ERROR status
→ Verify it appears in the backend despite low probabilistic sampling rate
[ ] Generate a test trace above the latency threshold
→ Verify it appears
[ ] Run for 24h and check backend storage growth rate
→ Does it match the expected rate given your sampling config?Propagation tests
[ ] Trigger a request that crosses at least two service boundaries
→ Verify a single trace ID appears in all three services' spans
[ ] Trigger an async job enqueued from an HTTP handler
→ Verify the job's spans appear as children of the HTTP span
[ ] Check ingress logs: is traceparent forwarded to the first backend service?Backend and cost limits
[ ] Backend retention policy set and tested (data volume × retention = storage cost)
[ ] Backend ingestion rate limits configured to prevent runaway cost
[ ] Collector gateway resource limits set with memory_limiter aligned to pod limitsRunbook ownership
[ ] Who is on call for Collector issues?
[ ] Is the Collector restart procedure documented?
[ ] What happens to traces in flight during a gateway rolling restart?
(Answer: spans in the tail sampling buffer are lost -- document this)
[ ] Is there a procedure for temporarily raising sampling rates during an incident?Limitations and trade-offs
Before committing to this architecture, the team should be clear on what it costs and what it cannot do.
Tail sampling memory is your primary operational risk. When a traffic spike pushes more simultaneous open traces than num_traces allows, the oldest traces are evicted before a sampling decision is made — and you lose data with no alerting unless you instrument the Collector's own metrics. This is not a warning you can configure away; it is a consequence of stateful buffering at scale.
Late spans corrupt tail sampling decisions. If a span arrives after the decision_wait timeout, it is processed as a new root trace. Its parent trace may have already been dropped. These orphaned late spans are visible in the backend as short single-span traces with no apparent parent — a diagnostic signal that your decision_wait is too short or a service has unusually high processing latency.
Context propagation is not automatic everywhere. Auto-instrumentation handles HTTP and gRPC. Everything else — message queues, cron jobs, batch processing, custom protocols — requires explicit propagation code. This is not a limitation you can configure around; it requires application changes in each service that runs async work.
Kubernetes metadata enrichment has a startup race. Spans emitted in the first few seconds of pod startup may arrive at the agent before the Kubernetes API has propagated the pod metadata. These spans are enriched partially or not at all. This is expected behavior; do not alert on missing k8s.pod.name without filtering out spans from pods younger than thirty seconds.
What to take from this
The architecture in this guide is not the simplest possible OTel setup. It is the minimum that works reliably at production scale in a multi-service Kubernetes cluster. The three decisions that matter most, and that hello-world examples do not force you to make:
Where context propagation breaks. It will break at the first async boundary, the first ingress that strips headers, and the first service written in a different language with a different propagator configured. Finding those breaks before your users do requires explicit tests — not assuming auto-instrumentation handled everything.
How much buffer tail sampling is needed. The num_traces limit is not a suggestion. When it is exceeded, the oldest traces are evicted silently. Size it for your burst traffic, not your average, and alert on memory pressure before the buffer fills.
That the Collector is now your infrastructure. Once services depend on it for telemetry, it needs the same operational treatment as your ingress or DNS: resource limits, readiness probes, a runbook, and someone on call who knows how to restart it without losing in-flight traces.
The gap between "OTel installed" and "OTel working" is mostly these three things. The YAML gets you there. The operational discipline keeps you there.
