MongoDB

Designing a Data Lakehouse Architecture

Combine the best of data lakes and data warehouses — covering Delta Lake, Apache Iceberg, ACID on object storage, and unified analytics.

S

srikanthtelkalapally888@gmail.com

A lakehouse combines the cheap storage of a data lake with the performance and ACID guarantees of a data warehouse.

Data Architecture Evolution

Data Warehouse (2000s):
  Structured data, fast queries
  Expensive, proprietary
  Poor ML support

Data Lake (2010s):
  All data types, cheap storage (S3)
  No ACID, poor query performance
  "Data swamp" — hard to trust data

Data Lakehouse (2020s):
  Best of both: cheap + ACID + performance
  Open formats, ML-native

Core Challenge: ACID on Object Storage

S3 lacks:
  Transactions (no atomic multi-file updates)
  Versioning with isolation
  Schema enforcement
  File-level ACID

Solution: Metadata layer on top of S3 (Delta Lake / Iceberg)

Delta Lake

Structure:
  S3: data files (Parquet) + _delta_log/ (transaction log)

_delta_log/:
  00000000000000000000.json  → Initial commit
  00000000000000000001.json  → Add files
  00000000000000000002.json  → Delete files
  ...

ACID operations:
  Write → Add new Parquet files + commit to log (atomic)
  Delete → Mark files as removed in log
  Update → Write new files + remove old (copy-on-write)

Apache Iceberg

Iceberg table format:
  Catalog → Table Metadata
             ↓
         Manifest List (snapshot)
             ↓
         Manifest Files (track data files)
             ↓
         Data Files (Parquet on S3)

Supports: Time travel, schema evolution, partition evolution

Time Travel

-- Query data as of a specific time (Iceberg)
SELECT * FROM sales
FOR SYSTEM_TIME AS OF '2026-01-01 00:00:00';

-- Delta Lake
SELECT * FROM delta.`s3://bucket/sales`
VERSION AS OF 42;

Lakehouse Architecture

Sources: Kafka, DBs, Files
    ↓
Bronze Layer: Raw, unprocessed (Delta/Iceberg)
    ↓
Silver Layer: Cleaned, validated
    ↓
Gold Layer: Aggregated, business-ready
    ↓
  BI Tools + ML Training (same data, no copies!)

Engines on Lakehouse

Batch:    Spark, Trino, Presto
Stream:   Flink (writes Delta/Iceberg)
SQL:      Athena, BigQuery Omni
ML:       PyTorch/TensorFlow read directly

Conclusion

The lakehouse eliminates the painful ETL between data lake and warehouse. Delta Lake and Iceberg bring ACID, time travel, and schema evolution to plain S3 storage.

Share this article