Live Demo: Build Scalable Event-Driven Microservices with Confluent | Register Now

Presentation

Mitigating Kafka Broker ‘Gray’ Failures For Key Based Partitioners With Partition Multihoming

« Current 2023

Kafka broker gray failures are a common source of incidents at New Relic. We define these gray failures as events where a broker slows or stops request processing while continuing to lead a partition. A single broker gray failure often has cascading impact for key-based partitioning producers, creating back pressure and upstream lag for messages to all brokers. While there have been adaptive partition switching improvements (e.g. KIP-794 Strictly Uniform Sticky Partitioner) that allow producers to route around problematic brokers, these do not work for key-based partitioning strategies.

In this talk I describe partition multihoming (PMH), a form of virtual partitioning where two or more physical Kafka partitions are guaranteed to be consumed by the same consumer instance. When a broker is unhealthy, a multihoming partitioner can route messages through partitions led by a healthy broker destined to the same consumer as before the failure, preserving application functionality.

We will go over:

The real world impact of gray failures that led to the creation of PMH
The implementation details of PMH on key-based topics
The PMH workflow when a broker becomes unhealthy
The future roadmap for adding virtual partitioning in Apache Kafka for PMH

After this talk you will understand why PMH is key to reliability and should become a first-class feature of Kafka.

Presenter

Christopher Wildman

New Relic

Chris Wildman has been at New Relic for 8 years working in the intersection of distributed systems and real-time high-throughput stream processing of application telemetry, often with a focus on stateful stream processors. This has included working on the massive time series database (NRDB), telemetry aggregation for alerts processing, distributed tracing, entity tracking, dimensional metrics, OpenTelemetry Protocol support and more. These projects rely on Kafka as the tool that helps New Relic reliably process data at an incredible scale. From this experience, Chris has learned many Kafka best practices and discovered where there are limitations.

Mitigating Kafka Broker ‘Gray’ Failures For Key Based Partitioners With Partition Multihoming

Presenter

Christopher Wildman

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how