KIP-1008: ParKa - the Marriage of Parquet and Kafka

« Kafka Summit London 2024

Apache Kafka is often used in conjunction with serialization formats like Avro or Protobuf for meticulous record-by-record operations. However, as message volumes burgeon and record batches expand, the integration of columnar storage, epitomized by Apache Parquet, offers significant advantages.

When handling larger record batches, Parquet's columnar encoding and compression excels, optimizing traffic throughput, and saving disk space. Adopting Parquet as Kafka's native format, which seamlessly aligns with the data lake's native format, transforms data lake ingestion. This enables the ingestion application to deposit entire record batches directly into the data lake as raw byte buffers, eliminating the need to unwrap records individually and write to Parquet files one by one. This process avoids the costly steps of encoding and compression that would otherwise be necessary. The result is enhanced data freshness and reduced resource consumption, making the ingestion job more agile and faster.

Beyond efficiency gains, ParKa capitalizes on the inherent column encryption feature of Parquet, enabling Kafka to have field encryption with minimal runtime overhead and encompassing support for all data types by just enabling it

In this talk, we will delve into the motivation behind the ParKa feature, exploring its diverse use cases, intricate design considerations, benchmarking results, and the progressive strides of KIP-1008. Join us as we unravel the narrative behind the synergy of Parquet and Kafka, showcasing how ParKa redefines data lake ingestion efficiency.

Presenter

Xinli Shang

Uber

KIP-1008: ParKa - the Marriage of Parquet and Kafka

Presenter

Xinli Shang

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how