MongoDB

Designing a Backup and Disaster Recovery System

Build a comprehensive backup and DR strategy — covering RPO, RTO, backup types, cross-region replication, and disaster recovery testing.

S

srikanthtelkalapally888@gmail.com

Designing a Backup and Disaster Recovery System

A DR system ensures business continuity by minimizing data loss and downtime after catastrophic failures.

Key Metrics

RPO (Recovery Point Objective):
  Maximum acceptable data loss
  RPO = 1 hour → Lost at most 1 hour of data

RTO (Recovery Time Objective):
  Maximum acceptable downtime
  RTO = 4 hours → Back online within 4 hours

More aggressive RPO/RTO = Higher cost

Backup Types

Full Backup:        Copy everything
  → Weekly, large, slow

Incremental Backup: Only changes since last backup
  → Daily, fast, small

Differential:       Changes since last FULL backup
  → Compromise between full and incremental

Continuous (CDC):   Capture every change in real-time
  → Lowest RPO, highest cost

Database Backup Strategy

PostgreSQL:
  pg_basebackup (full, weekly)
  WAL archiving (continuous, to S3)
  Point-in-time recovery: replay WAL to any second

Schedule:
  Sunday:    Full backup (S3)
  Mon-Sat:   Incremental + WAL streaming
  Retention: 30 days local, 1 year archive (S3 Glacier)

3-2-1 Backup Rule

3 copies of data
2 different storage media
1 offsite (different region)

Example:
  Primary DB (us-east-1)
  Replica DB (us-east-1, different AZ)
  S3 backup (us-west-2, cross-region)

DR Strategies (Cost vs RTO)

Cold Standby (RTO: hours):
  Backup only, no running systems
  Cheapest. Restore from backup on disaster.

Warm Standby (RTO: minutes):
  Minimal running infrastructure in DR region
  Scale up on disaster.

Hot Standby / Active-Active (RTO: seconds):
  Full replica running in DR region
  Most expensive, instant failover.

DR Testing

Gameday exercises: Quarterly chaos drills
Automated DR test: Monthly, automated failover
Sandbox restore:   Weekly, restore backup to test env
Measure actual RTO vs target: Is it really 4 hours?

Conclusion

RPO/RTO requirements drive DR architecture cost. 3-2-1 backups with WAL streaming achieve near-zero RPO. Test DR procedures regularly — an untested DR plan is not a DR plan.

Share this article