MongoDB

Designing a Job Scheduling System

Build a distributed cron-like scheduler — covering job persistence, leader election, missed job recovery, and exactly-once execution guarantees.

S

srikanthtelkalapally888@gmail.com

Designing a Job Scheduling System

A job scheduler executes tasks at specified times or intervals across a distributed fleet.

Requirements

  • Cron-style scheduling (0 0 * * *)
  • One-time scheduled jobs
  • Exactly-once execution
  • Job persistence (survive restarts)
  • Missed job recovery
  • Dashboard + audit log

Architecture

Job Registry (DB)
      ↓
 Scheduler (Leader)
      ↓
  Job Queue (Kafka)
      ↓
  Worker Pool
      ↓
  Job Results DB

Leader Election

Only ONE scheduler should fire jobs to avoid duplicates.

Multiple scheduler instances compete for lock:
Redis: SET scheduler:leader {instance_id} NX EX 30

Leader: Fires jobs, refreshes lock every 10s
Followers: Poll for lock, take over if leader dies

Job Firing Logic

while True:
  now = time.now()
  due_jobs = db.query("""
    SELECT * FROM jobs
    WHERE next_run <= now
      AND status = 'active'
    FOR UPDATE SKIP LOCKED  -- Prevents duplicate pickup
  """)
  
  for job in due_jobs:
    kafka.publish('job_queue', job)
    job.next_run = cron.next(job.schedule, now)
    db.update(job)

  time.sleep(1)

Missed Job Recovery

If scheduler was down for 2 hours:
  Jobs missed_count = (downtime / interval)
  Policy options:
    1. Skip all missed runs (most common)
    2. Run once immediately (catch-up)
    3. Run all missed (rarely desired)

Exactly-Once Execution

Worker picks job from Kafka
    ↓
Check job_executions table for this job_id + scheduled_time
Already exists? → Skip (idempotent)
    ↓
Execute job
    ↓
Insert execution record (commit)

Job Definition

{
  "id": "job_123",
  "name": "daily-report",
  "schedule": "0 8 * * *",
  "timezone": "America/New_York",
  "type": "http",
  "endpoint": "https://api.example.com/reports/generate",
  "timeout": 300,
  "retry": { "max_attempts": 3, "backoff": "exponential" }
}

Conclusion

Distributed schedulers require leader election for exactly-once triggering, FOR UPDATE SKIP LOCKED for safe job pickup, and explicit missed-job recovery policies.

Share this article