The medallion architecture graduates raw data sitting in operational systems into a set of refined tables in a series of stages, ultimately processing data to serve analytics from gold tables. While there is a deep desire to build this architecture incrementally from streaming data sources like Kafka, it is very challenging with current technologies available on lakehouses; a lot of technologies can’t efficiently update records or efficiently process incremental data without recomputing all the data to serve low-latency tables. Apache Hudi is a transactional data lake platform with full mutability support, including streaming upserts, and provides a powerful incremental processing framework. Apache Hudi powers the largest transactional data lakes in the industry, differentiating on fast upserts and change streams to only process and serve the change records.
To further improve the upsert performance, Hudi now supports a new record-level index that deterministically maps the record key to the file location orders of magnitude faster. As a result, Hudi speeds up computationally expensive MERGE operations even more by avoiding full table scans. On the query side, Hudi now supports database-style change data capture with before, and after images to chain flow of inserts, updates and deletes change records from bronze to silver to gold tables.
In this talk, attendees will walk away with:
The current challenges of building a medallion architecture at low-latency
How the record index and incremental updates work with Apache Hudi
How the new Hudi CDC feature unlocks incremental processing on the lake
How you can efficiently build a medallion architecture by avoiding expensive operations