Live Demo: Build Scalable Event-Driven Microservices with Confluent | Register Now

Presentation

Mitigating Kafka Broker ‘Gray’ Failures For Key Based Partitioners With Partition Multihoming

« Current 2023

Kafka broker gray failures are a common source of incidents at New Relic. We define these gray failures as events where a broker slows or stops request processing while continuing to lead a partition. A single broker gray failure often has cascading impact for key-based partitioning producers, creating back pressure and upstream lag for messages to all brokers. While there have been adaptive partition switching improvements (e.g. KIP-794 Strictly Uniform Sticky Partitioner) that allow producers to route around problematic brokers, these do not work for key-based partitioning strategies.

In this talk I describe partition multihoming (PMH), a form of virtual partitioning where two or more physical Kafka partitions are guaranteed to be consumed by the same consumer instance. When a broker is unhealthy, a multihoming partitioner can route messages through partitions led by a healthy broker destined to the same consumer as before the failure, preserving application functionality.

We will go over:

  • The real world impact of gray failures that led to the creation of PMH

  • The implementation details of PMH on key-based topics

  • The PMH workflow when a broker becomes unhealthy

  • The future roadmap for adding virtual partitioning in Apache Kafka for PMH

After this talk you will understand why PMH is key to reliability and should become a first-class feature of Kafka.

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how