Confluent Platform 7.0 と Cluster Linking でクラウドへのリアルタイムブリッジを構築 | ブログを読む

Streaming Data Lakes using Kafka Connect + Apache Hudi

Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.

Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.

プレゼンター

Vinoth Chandar

Vinoth Chandar is the original creator & VP of the Apache Hudi project, which has changed the face of data lake architectures over the past few years. Vinoth has a keen interest in unified data storage/processing architectures. He drove various efforts around stream processing/Kafka at Confluent. In the past, Vinoth has built large-scale, mission-critical infrastructure systems at companies like Uber and LinkedIn.

Balaji Varadarajan

Balaji Varadarajan is a Sr.Staff Engineer at Robinhood where he broadly oversees Robinhood’s data lake. He is also an Apache Hudi PMC member. Previously, he was a tech lead in Uber data ingestion team and one of the lead engineers on LinkedIn’s databus change capture system. Balaji’s interests lie in distributed data systems.