[Demo+Webinar] New Product Updates to Make Serverless Flink a Developer’s Best Friend | Watch Now

Presentation

A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

« Current 2023

The medallion architecture graduates raw data sitting in operational systems into a set of refined tables in a series of stages, ultimately processing data to serve analytics from gold tables. While there is a deep desire to build this architecture incrementally from streaming data sources like Kafka, it is very challenging with current technologies available on lakehouses; a lot of technologies can’t efficiently update records or efficiently process incremental data without recomputing all the data to serve low-latency tables. Apache Hudi is a transactional data lake platform with full mutability support, including streaming upserts, and provides a powerful incremental processing framework. Apache Hudi powers the largest transactional data lakes in the industry, differentiating on fast upserts and change streams to only process and serve the change records.

To further improve the upsert performance, Hudi now supports a new record-level index that deterministically maps the record key to the file location orders of magnitude faster. As a result, Hudi speeds up computationally expensive MERGE operations even more by avoiding full table scans. On the query side, Hudi now supports database-style change data capture with before, and after images to chain flow of inserts, updates and deletes change records from bronze to silver to gold tables.

In this talk, attendees will walk away with:

  • The current challenges of building a medallion architecture at low-latency

  • How the record index and incremental updates work with Apache Hudi

  • How the new Hudi CDC feature unlocks incremental processing on the lake

  • How you can efficiently build a medallion architecture by avoiding expensive operations

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how