New in Confluent Cloud: Making Data & Pipelines Accessible for AI-Ready Streaming | Learn More

Presentation

Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries

« Kafka Summit Americas 2021

In a real-time data ingestion pipeline for analytical processing, efficient and fast data loading to a columnar database such as ClickHouse favors large blocks over individual rows. Therefore, applications often rely on some buffering mechanism such as Kafka to store data temporarily, and having a message processing engine to aggregate Kafka messages into large blocks which then get loaded to the backend database. Due to various failures in this pipeline, a naive block aggregator that forms blocks without additional measures, would cause data duplication or data loss. We have developed a solution to avoid these issues, thereby achieving exactly-once delivery from Kafka to ClickHouse. Our solution utilizes Kafka’s metadata to keep track of blocks that we intend to send to ClickHouse, and later uses this metadata information to deterministically re-produce ClickHouse blocks for re-tries in case of failures. The identical blocks are guaranteed to be deduplicated by ClickHouse. We have also developed a run-time verification tool that monitors Kafka’s internal metadata topic, and raises alerts when the required invariants for exactly-once delivery are violated. Our solution has been developed and deployed to the production clusters that span multiple datacenters at eBay.

Presenter

Jun Li

eBay

Jun Li is currently a Principal Architect at eBay. Over the last four years, he has been working on GraphDB and Columnar Store, to ensure high performance, high scalability and high availability of the involved database engines in the cloud-based environment.

Before joining eBay in 2017, Jun spent 16 years in Hewlett Packard Labs at Palo Alto, with focus on large-scale distributed processing systems, covering innovative architectural and application features, scalable and high-performance run-time execution and monitoring, optimized algorithms that can run efficiently at scale, and system security guarantees.

Jun has over 20 patents granted and the other 30 patents pending for his work done in the last 20 years. Jun received his Ph.D. in Computer Engineering from Carnegie Mellon University in 2000.

Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries

Presenter

Jun Li

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how