The last years have taught us that cheap, virtually unlimited, and highly available cloud object storage doesn't make a solid enterprise data platform. Too many data lakes didn't fulfill their expectations and degenerated into sad data swamps.
With the Linux Foundation OSS project Delta Lake (https://github.com/delta-io), you can turn your data lake into the foundation of a data lakehouse that brings back ACID transactions, schema enforcement, upserts, efficient metadata handling, and time travel.
In this session, we explore how a data lakehouse works with streaming, using Apache Kafka as an example.
This talk is for data architects who are not afraid of some code and for data engineers who love open source and cloud services.
Attendees of this talk will learn:
Lakehouse architecture 101, the honest tech bits
The data lakehouse and streaming data: what's there beyond Apache Spark™ Structured Streaming?
Why the lakehouse and Apache Kafka make a great couple and what concepts you should know to get them hitched with success.
Streaming data with declarative data pipelines: In a live demo, I will show data ingestion, cleansing, and transformation based on a simulation of the Data Donation Project (DDP, https://corona-datenspende.de/science/en) built on the lakehouse with Apache Kafka, Apache Spark™, and Delta Live Tables (a fully managed service).
DDP is a scientific IoT experiment to determine COVID outbreaks in Germany by detecting elevated heart rates correlated to infections. Half a million volunteers have already decided to donate their heart rate data from their fitness trackers.