MongoDB

Designing a Distributed Tracing Sampling Strategy

Master trace sampling approaches — head-based, tail-based, adaptive, and priority sampling — to reduce overhead while preserving signal in distributed systems.

S

srikanthtelkalapally888@gmail.com

Distributed tracing generates enormous data volumes. Sampling strategies reduce overhead while preserving visibility into errors and slow requests.

The Sampling Problem

With 100K req/sec × 100 spans per trace:
  = 10M spans/sec
  At 1KB per span = 10GB/sec
  → Impossible to store/process all

Solution: Sample N% of traces

Head-Based Sampling

Decision made at trace entry point:

Gateway receives request:
  Random number < sample_rate? → Record full trace
                              → Discard

Pros: Simple, low overhead
Cons: No visibility into errors if not sampled

Problem: A rare but critical error trace sampled away!

Tail-Based Sampling

Decision made AFTER trace completes:

Collect ALL spans in buffer
    ↓
Wait for trace to complete (30s window)
    ↓
Evaluate completed trace:
  Error? → KEEP
  Slow (>1s)? → KEEP
  Random 1%? → KEEP
  Otherwise → DISCARD
    ↓
Write kept traces to storage
Pros: Never miss errors or slow requests
Cons: Buffer all spans temporarily (memory intensive)
Tools: Jaeger, OpenTelemetry Collector

Priority Sampling

Assign priority to each trace:

Priority 1 (always keep):
  Any error span
  Latency > p99 threshold
  User flagged as VIP
  Manual debug trace

Priority 2 (sample 10%):
  Latency > p95

Priority 3 (sample 1%):
  Normal requests

Priority 0 (drop):
  Health checks, synthetic monitors

Adaptive Sampling

Dynamically adjust rate based on volume:

Target: 100 traces/sec per service

If service sees 1000 req/sec: sample 10%
If service sees 10K req/sec:  sample 1%
If service sees 100 req/sec:  sample 100%

Maintains consistent trace volume regardless of traffic

Propagation

Sampling decision propagated in trace context:
  traceparent: 00-traceId-spanId-01 (sampled)
  traceparent: 00-traceId-spanId-00 (not sampled)

All downstream services respect upstream decision
→ Either all spans kept or all discarded (no partial traces)

Conclusion

Tail-based sampling with priority rules is the gold standard — you never miss errors, always capture slow traces, and sample everything else proportionally.

Share this article