MongoDB

Designing a Batch Processing System

Build a large-scale batch processing pipeline using Apache Spark, MapReduce, and workflow orchestrators like Airflow for ETL and analytics.

S

srikanthtelkalapally888@gmail.com

Designing a Batch Processing System

Batch processing handles large volumes of data in scheduled runs, complementing real-time stream processing.

Batch vs Stream

Batch:  Process all data for yesterday → Daily report
Stream: Process each event as it arrives → Live dashboard

Batch: High latency, high throughput
Stream: Low latency, lower throughput

MapReduce Pattern

Input: 1TB of log files

Map Phase (parallel):
  Worker 1: logs 1-100MB → {url: count}
  Worker 2: logs 101-200MB → {url: count}
  ...

Shuffle: Group by key across workers

Reduce Phase:
  All counts for /home → Sum → /home: 50,000
  All counts for /about → Sum → /about: 12,000

Apache Spark

Improved MapReduce — keeps data in memory, DAG execution.

df = spark.read.parquet('s3://data/events/')

result = (df
  .filter(df.date == '2026-03-22')
  .groupBy('user_id')
  .agg(count('event').alias('event_count'))
  .orderBy('event_count', ascending=False)
)

result.write.parquet('s3://output/daily-user-stats/')

Orchestration with Airflow

with DAG('daily_etl', schedule='@daily') as dag:
  extract = S3ToRedshiftOperator(task_id='extract', ...)
  transform = SparkSubmitOperator(task_id='transform', ...)
  load = PythonOperator(task_id='load', ...)
  notify = SlackOperator(task_id='notify', ...)

  extract >> transform >> load >> notify

Idempotent Batch Jobs

Partition output by date:
  s3://output/date=2026-03-22/part-*.parquet

Rerun: Overwrite same partition
Result: Same output regardless of reruns

Performance Optimization

Partitioning: Process data in parallel partitions
Broadcast joins: Small tables broadcast to all workers
Data skew: Salt key to prevent hot partitions
Predicate pushdown: Filter early, before shuffle
File format: Parquet (columnar, compressed)

Conclusion

Spark + Airflow is the industry standard for batch processing. Idempotent jobs, Parquet storage, and DAG orchestration enable reliable, maintainable pipelines.

Share this article