MongoDB
Designing a Time-Series Database
Build a purpose-built time-series database — covering LSM storage, time-based partitioning, downsampling, retention policies, and compression for metrics.
S
srikanthtelkalapally888@gmail.com
Time-series databases are purpose-built for timestamped data — metrics, sensor readings, financial ticks — with unique access patterns.
Why Not General-Purpose DB?
Time-series characteristics:
Writes: Always to current time (append-heavy)
Reads: Ranges by time, aggregations
Deletes: By age (retention), not individual rows
Updates: Never (immutable historical data)
Cardinality: Millions of unique series
Relational DB struggles with:
Continuous high-frequency writes
Time-range aggregation performance
Automatic data expiry
High-cardinality tag indexing
Data Model
Measurement: cpu_usage
Tags (indexed): { host: "server-01", region: "us-east" }
Fields (values): { usage_percent: 72.3, load_avg: 1.4 }
Timestamp: 1709568000000000000 (nanoseconds)
Series = unique combination of measurement + tags
Storage Design
Time-ordered LSM variant:
WAL → MemTable (current time window)
↓ (flush when full)
TSM Files (Time-Structured Merge)
↓ (periodic compaction)
Compressed TSM Files
Compression
Timestamps: Delta-delta encoding (differences of differences)
[1000, 1005, 1010, 1015] → [1000, 5, 0, 0] → Very compressible
Floats: XOR encoding (Gorilla algorithm)
Similar consecutive values differ only in low bits
→ Excellent compression for slowly changing metrics
Integers: Simple8b (pack multiple small ints into 64 bits)
Result: 10-90x compression vs raw values
Time-Based Partitioning (Shards)
Shard by time range:
Shard 1: 2026-01-01 to 2026-01-07
Shard 2: 2026-01-08 to 2026-01-14
...
Benefits:
Drop entire shard for retention (instant delete)
Query only relevant time shards
Hot shard: latest data (fast writes)
Cold shards: compressed, read-only
Downsampling
Raw data (1s resolution):
Too much for long-term storage
Continuous aggregation:
1s data → Keep 7 days
1m averages (from 1s) → Keep 90 days
1h averages (from 1m) → Keep 2 years
1d averages (from 1h) → Keep forever
Query engine: Automatically uses appropriate resolution
Query Examples
-- InfluxQL
SELECT mean(usage_percent)
FROM cpu_usage
WHERE host='server-01'
AND time > now() - 24h
GROUP BY time(5m)
-- Fill gaps in time series
FILL(previous) -- Or null, linear, 0
Conclusion
TSDBs achieve their performance through time-partitioned storage, delta/XOR compression, continuous downsampling, and time-range-aware indexing. InfluxDB, TimescaleDB, and Prometheus embody these principles.