Live Demo: Build Scalable Event-Driven Microservices with Confluent | Register Now
Kafka broker gray failures are a common source of incidents at New Relic. We define these gray failures as events where a broker slows or stops request processing while continuing to lead a partition. A single broker gray failure often has cascading impact for key-based partitioning producers, creating back pressure and upstream lag for messages to all brokers. While there have been adaptive partition switching improvements (e.g. KIP-794 Strictly Uniform Sticky Partitioner) that allow producers to route around problematic brokers, these do not work for key-based partitioning strategies.
In this talk I describe partition multihoming (PMH), a form of virtual partitioning where two or more physical Kafka partitions are guaranteed to be consumed by the same consumer instance. When a broker is unhealthy, a multihoming partitioner can route messages through partitions led by a healthy broker destined to the same consumer as before the failure, preserving application functionality.
We will go over:
The real world impact of gray failures that led to the creation of PMH
The implementation details of PMH on key-based topics
The PMH workflow when a broker becomes unhealthy
The future roadmap for adding virtual partitioning in Apache Kafka for PMH
After this talk you will understand why PMH is key to reliability and should become a first-class feature of Kafka.