MongoDB
Designing a Distributed Tracing Sampling Strategy
Master trace sampling approaches — head-based, tail-based, adaptive, and priority sampling — to reduce overhead while preserving signal in distributed systems.
S
srikanthtelkalapally888@gmail.com
Distributed tracing generates enormous data volumes. Sampling strategies reduce overhead while preserving visibility into errors and slow requests.
The Sampling Problem
With 100K req/sec × 100 spans per trace:
= 10M spans/sec
At 1KB per span = 10GB/sec
→ Impossible to store/process all
Solution: Sample N% of traces
Head-Based Sampling
Decision made at trace entry point:
Gateway receives request:
Random number < sample_rate? → Record full trace
→ Discard
Pros: Simple, low overhead
Cons: No visibility into errors if not sampled
Problem: A rare but critical error trace sampled away!
Tail-Based Sampling
Decision made AFTER trace completes:
Collect ALL spans in buffer
↓
Wait for trace to complete (30s window)
↓
Evaluate completed trace:
Error? → KEEP
Slow (>1s)? → KEEP
Random 1%? → KEEP
Otherwise → DISCARD
↓
Write kept traces to storage
Pros: Never miss errors or slow requests
Cons: Buffer all spans temporarily (memory intensive)
Tools: Jaeger, OpenTelemetry Collector
Priority Sampling
Assign priority to each trace:
Priority 1 (always keep):
Any error span
Latency > p99 threshold
User flagged as VIP
Manual debug trace
Priority 2 (sample 10%):
Latency > p95
Priority 3 (sample 1%):
Normal requests
Priority 0 (drop):
Health checks, synthetic monitors
Adaptive Sampling
Dynamically adjust rate based on volume:
Target: 100 traces/sec per service
If service sees 1000 req/sec: sample 10%
If service sees 10K req/sec: sample 1%
If service sees 100 req/sec: sample 100%
Maintains consistent trace volume regardless of traffic
Propagation
Sampling decision propagated in trace context:
traceparent: 00-traceId-spanId-01 (sampled)
traceparent: 00-traceId-spanId-00 (not sampled)
All downstream services respect upstream decision
→ Either all spans kept or all discarded (no partial traces)
Conclusion
Tail-based sampling with priority rules is the gold standard — you never miss errors, always capture slow traces, and sample everything else proportionally.