MongoDB
Designing a Backup and Disaster Recovery System
Build a comprehensive backup and DR strategy — covering RPO, RTO, backup types, cross-region replication, and disaster recovery testing.
S
srikanthtelkalapally888@gmail.com
Designing a Backup and Disaster Recovery System
A DR system ensures business continuity by minimizing data loss and downtime after catastrophic failures.
Key Metrics
RPO (Recovery Point Objective):
Maximum acceptable data loss
RPO = 1 hour → Lost at most 1 hour of data
RTO (Recovery Time Objective):
Maximum acceptable downtime
RTO = 4 hours → Back online within 4 hours
More aggressive RPO/RTO = Higher cost
Backup Types
Full Backup: Copy everything
→ Weekly, large, slow
Incremental Backup: Only changes since last backup
→ Daily, fast, small
Differential: Changes since last FULL backup
→ Compromise between full and incremental
Continuous (CDC): Capture every change in real-time
→ Lowest RPO, highest cost
Database Backup Strategy
PostgreSQL:
pg_basebackup (full, weekly)
WAL archiving (continuous, to S3)
Point-in-time recovery: replay WAL to any second
Schedule:
Sunday: Full backup (S3)
Mon-Sat: Incremental + WAL streaming
Retention: 30 days local, 1 year archive (S3 Glacier)
3-2-1 Backup Rule
3 copies of data
2 different storage media
1 offsite (different region)
Example:
Primary DB (us-east-1)
Replica DB (us-east-1, different AZ)
S3 backup (us-west-2, cross-region)
DR Strategies (Cost vs RTO)
Cold Standby (RTO: hours):
Backup only, no running systems
Cheapest. Restore from backup on disaster.
Warm Standby (RTO: minutes):
Minimal running infrastructure in DR region
Scale up on disaster.
Hot Standby / Active-Active (RTO: seconds):
Full replica running in DR region
Most expensive, instant failover.
DR Testing
Gameday exercises: Quarterly chaos drills
Automated DR test: Monthly, automated failover
Sandbox restore: Weekly, restore backup to test env
Measure actual RTO vs target: Is it really 4 hours?
Conclusion
RPO/RTO requirements drive DR architecture cost. 3-2-1 backups with WAL streaming achieve near-zero RPO. Test DR procedures regularly — an untested DR plan is not a DR plan.