MongoDB

Designing a Time-Series Database

Build a purpose-built time-series database — covering LSM storage, time-based partitioning, downsampling, retention policies, and compression for metrics.

S

srikanthtelkalapally888@gmail.com

Time-series databases are purpose-built for timestamped data — metrics, sensor readings, financial ticks — with unique access patterns.

Why Not General-Purpose DB?

Time-series characteristics:
  Writes: Always to current time (append-heavy)
  Reads:  Ranges by time, aggregations
  Deletes: By age (retention), not individual rows
  Updates: Never (immutable historical data)
  Cardinality: Millions of unique series

Relational DB struggles with:
  Continuous high-frequency writes
  Time-range aggregation performance
  Automatic data expiry
  High-cardinality tag indexing

Data Model

Measurement: cpu_usage
Tags (indexed): { host: "server-01", region: "us-east" }
Fields (values): { usage_percent: 72.3, load_avg: 1.4 }
Timestamp: 1709568000000000000 (nanoseconds)

Series = unique combination of measurement + tags

Storage Design

Time-ordered LSM variant:

WAL → MemTable (current time window)
           ↓ (flush when full)
      TSM Files (Time-Structured Merge)
           ↓ (periodic compaction)
      Compressed TSM Files

Compression

Timestamps:  Delta-delta encoding (differences of differences)
  [1000, 1005, 1010, 1015] → [1000, 5, 0, 0] → Very compressible

Floats:      XOR encoding (Gorilla algorithm)
  Similar consecutive values differ only in low bits
  → Excellent compression for slowly changing metrics

Integers:    Simple8b (pack multiple small ints into 64 bits)

Result: 10-90x compression vs raw values

Time-Based Partitioning (Shards)

Shard by time range:
  Shard 1: 2026-01-01 to 2026-01-07
  Shard 2: 2026-01-08 to 2026-01-14
  ...

Benefits:
  Drop entire shard for retention (instant delete)
  Query only relevant time shards
  Hot shard: latest data (fast writes)
  Cold shards: compressed, read-only

Downsampling

Raw data (1s resolution):
  Too much for long-term storage

Continuous aggregation:
  1s data → Keep 7 days
  1m averages (from 1s) → Keep 90 days
  1h averages (from 1m) → Keep 2 years
  1d averages (from 1h) → Keep forever

Query engine: Automatically uses appropriate resolution

Query Examples

-- InfluxQL
SELECT mean(usage_percent)
FROM cpu_usage
WHERE host='server-01'
  AND time > now() - 24h
GROUP BY time(5m)

-- Fill gaps in time series
FILL(previous)  -- Or null, linear, 0

Conclusion

TSDBs achieve their performance through time-partitioned storage, delta/XOR compression, continuous downsampling, and time-range-aware indexing. InfluxDB, TimescaleDB, and Prometheus embody these principles.

Share this article