MongoDB
Designing a Batch Processing System
Build a large-scale batch processing pipeline using Apache Spark, MapReduce, and workflow orchestrators like Airflow for ETL and analytics.
S
srikanthtelkalapally888@gmail.com
Designing a Batch Processing System
Batch processing handles large volumes of data in scheduled runs, complementing real-time stream processing.
Batch vs Stream
Batch: Process all data for yesterday → Daily report
Stream: Process each event as it arrives → Live dashboard
Batch: High latency, high throughput
Stream: Low latency, lower throughput
MapReduce Pattern
Input: 1TB of log files
Map Phase (parallel):
Worker 1: logs 1-100MB → {url: count}
Worker 2: logs 101-200MB → {url: count}
...
Shuffle: Group by key across workers
Reduce Phase:
All counts for /home → Sum → /home: 50,000
All counts for /about → Sum → /about: 12,000
Apache Spark
Improved MapReduce — keeps data in memory, DAG execution.
df = spark.read.parquet('s3://data/events/')
result = (df
.filter(df.date == '2026-03-22')
.groupBy('user_id')
.agg(count('event').alias('event_count'))
.orderBy('event_count', ascending=False)
)
result.write.parquet('s3://output/daily-user-stats/')
Orchestration with Airflow
with DAG('daily_etl', schedule='@daily') as dag:
extract = S3ToRedshiftOperator(task_id='extract', ...)
transform = SparkSubmitOperator(task_id='transform', ...)
load = PythonOperator(task_id='load', ...)
notify = SlackOperator(task_id='notify', ...)
extract >> transform >> load >> notify
Idempotent Batch Jobs
Partition output by date:
s3://output/date=2026-03-22/part-*.parquet
Rerun: Overwrite same partition
Result: Same output regardless of reruns
Performance Optimization
Partitioning: Process data in parallel partitions
Broadcast joins: Small tables broadcast to all workers
Data skew: Salt key to prevent hot partitions
Predicate pushdown: Filter early, before shuffle
File format: Parquet (columnar, compressed)
Conclusion
Spark + Airflow is the industry standard for batch processing. Idempotent jobs, Parquet storage, and DAG orchestration enable reliable, maintainable pipelines.