MongoDB

Designing a Metrics and Monitoring System

Build a system like Prometheus + Grafana — covering time-series storage, metric collection, alerting, and dashboarding.

S

srikanthtelkalapally888@gmail.com

Designing a Metrics and Monitoring System

A metrics system collects, stores, and visualizes system health and performance data.

Requirements

  • Collect metrics from 1000+ services
  • Sub-minute granularity
  • Query metrics over time
  • Alerting on anomalies
  • Dashboards

Metrics Types

Counter:   Always increasing (requests_total)
Gauge:     Can go up/down (memory_usage_bytes)
Histogram: Distribution with buckets (request_duration)
Summary:   Pre-calculated quantiles

Architecture

Services (expose /metrics endpoint)
         ↓ (scrape every 15s)
     Prometheus
         ↓
  Long-term Storage
  (Thanos / Cortex)
         ↓
  Grafana (visualization)
  AlertManager (alerting)

Time-Series Storage

Data model: metric_name{labels} → [(timestamp, value)]

http_requests_total{service="order", status="200"} → [(t1, 100), (t2, 105)]

Prometheus uses a custom TSDB with chunk-based compression.

PromQL Example

# Request rate per service
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status!~"2.."}[5m]) / rate(http_requests_total[5m])

Alerting Rules

alert: HighErrorRate
expr: error_rate > 0.05
for: 5m
annotations:
  summary: "Error rate above 5%"

Retention Strategy

Prometheus: Raw data, 15 days
Thanos: Downsampled, 1 year
Archive: Aggregated, indefinite

Conclusion

Prometheus + Grafana + AlertManager is the industry-standard open-source monitoring stack. Scale long-term storage with Thanos or Cortex.

Share this article