MongoDB
Designing a Data Lakehouse Architecture
Combine the best of data lakes and data warehouses — covering Delta Lake, Apache Iceberg, ACID on object storage, and unified analytics.
S
srikanthtelkalapally888@gmail.com
A lakehouse combines the cheap storage of a data lake with the performance and ACID guarantees of a data warehouse.
Data Architecture Evolution
Data Warehouse (2000s):
Structured data, fast queries
Expensive, proprietary
Poor ML support
Data Lake (2010s):
All data types, cheap storage (S3)
No ACID, poor query performance
"Data swamp" — hard to trust data
Data Lakehouse (2020s):
Best of both: cheap + ACID + performance
Open formats, ML-native
Core Challenge: ACID on Object Storage
S3 lacks:
Transactions (no atomic multi-file updates)
Versioning with isolation
Schema enforcement
File-level ACID
Solution: Metadata layer on top of S3 (Delta Lake / Iceberg)
Delta Lake
Structure:
S3: data files (Parquet) + _delta_log/ (transaction log)
_delta_log/:
00000000000000000000.json → Initial commit
00000000000000000001.json → Add files
00000000000000000002.json → Delete files
...
ACID operations:
Write → Add new Parquet files + commit to log (atomic)
Delete → Mark files as removed in log
Update → Write new files + remove old (copy-on-write)
Apache Iceberg
Iceberg table format:
Catalog → Table Metadata
↓
Manifest List (snapshot)
↓
Manifest Files (track data files)
↓
Data Files (Parquet on S3)
Supports: Time travel, schema evolution, partition evolution
Time Travel
-- Query data as of a specific time (Iceberg)
SELECT * FROM sales
FOR SYSTEM_TIME AS OF '2026-01-01 00:00:00';
-- Delta Lake
SELECT * FROM delta.`s3://bucket/sales`
VERSION AS OF 42;
Lakehouse Architecture
Sources: Kafka, DBs, Files
↓
Bronze Layer: Raw, unprocessed (Delta/Iceberg)
↓
Silver Layer: Cleaned, validated
↓
Gold Layer: Aggregated, business-ready
↓
BI Tools + ML Training (same data, no copies!)
Engines on Lakehouse
Batch: Spark, Trino, Presto
Stream: Flink (writes Delta/Iceberg)
SQL: Athena, BigQuery Omni
ML: PyTorch/TensorFlow read directly
Conclusion
The lakehouse eliminates the painful ETL between data lake and warehouse. Delta Lake and Iceberg bring ACID, time travel, and schema evolution to plain S3 storage.