MongoDB
Designing a Job Scheduling System
Build a distributed cron-like scheduler — covering job persistence, leader election, missed job recovery, and exactly-once execution guarantees.
S
srikanthtelkalapally888@gmail.com
Designing a Job Scheduling System
A job scheduler executes tasks at specified times or intervals across a distributed fleet.
Requirements
- Cron-style scheduling (
0 0 * * *) - One-time scheduled jobs
- Exactly-once execution
- Job persistence (survive restarts)
- Missed job recovery
- Dashboard + audit log
Architecture
Job Registry (DB)
↓
Scheduler (Leader)
↓
Job Queue (Kafka)
↓
Worker Pool
↓
Job Results DB
Leader Election
Only ONE scheduler should fire jobs to avoid duplicates.
Multiple scheduler instances compete for lock:
Redis: SET scheduler:leader {instance_id} NX EX 30
Leader: Fires jobs, refreshes lock every 10s
Followers: Poll for lock, take over if leader dies
Job Firing Logic
while True:
now = time.now()
due_jobs = db.query("""
SELECT * FROM jobs
WHERE next_run <= now
AND status = 'active'
FOR UPDATE SKIP LOCKED -- Prevents duplicate pickup
""")
for job in due_jobs:
kafka.publish('job_queue', job)
job.next_run = cron.next(job.schedule, now)
db.update(job)
time.sleep(1)
Missed Job Recovery
If scheduler was down for 2 hours:
Jobs missed_count = (downtime / interval)
Policy options:
1. Skip all missed runs (most common)
2. Run once immediately (catch-up)
3. Run all missed (rarely desired)
Exactly-Once Execution
Worker picks job from Kafka
↓
Check job_executions table for this job_id + scheduled_time
Already exists? → Skip (idempotent)
↓
Execute job
↓
Insert execution record (commit)
Job Definition
{
"id": "job_123",
"name": "daily-report",
"schedule": "0 8 * * *",
"timezone": "America/New_York",
"type": "http",
"endpoint": "https://api.example.com/reports/generate",
"timeout": 300,
"retry": { "max_attempts": 3, "backoff": "exponential" }
}
Conclusion
Distributed schedulers require leader election for exactly-once triggering, FOR UPDATE SKIP LOCKED for safe job pickup, and explicit missed-job recovery policies.