MongoDB
Designing a Metrics and Monitoring System
Build a system like Prometheus + Grafana — covering time-series storage, metric collection, alerting, and dashboarding.
S
srikanthtelkalapally888@gmail.com
Designing a Metrics and Monitoring System
A metrics system collects, stores, and visualizes system health and performance data.
Requirements
- Collect metrics from 1000+ services
- Sub-minute granularity
- Query metrics over time
- Alerting on anomalies
- Dashboards
Metrics Types
Counter: Always increasing (requests_total)
Gauge: Can go up/down (memory_usage_bytes)
Histogram: Distribution with buckets (request_duration)
Summary: Pre-calculated quantiles
Architecture
Services (expose /metrics endpoint)
↓ (scrape every 15s)
Prometheus
↓
Long-term Storage
(Thanos / Cortex)
↓
Grafana (visualization)
AlertManager (alerting)
Time-Series Storage
Data model: metric_name{labels} → [(timestamp, value)]
http_requests_total{service="order", status="200"} → [(t1, 100), (t2, 105)]
Prometheus uses a custom TSDB with chunk-based compression.
PromQL Example
# Request rate per service
rate(http_requests_total[5m])
# 99th percentile latency
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status!~"2.."}[5m]) / rate(http_requests_total[5m])
Alerting Rules
alert: HighErrorRate
expr: error_rate > 0.05
for: 5m
annotations:
summary: "Error rate above 5%"
Retention Strategy
Prometheus: Raw data, 15 days
Thanos: Downsampled, 1 year
Archive: Aggregated, indefinite
Conclusion
Prometheus + Grafana + AlertManager is the industry-standard open-source monitoring stack. Scale long-term storage with Thanos or Cortex.