Reducing Impact of Single Broker Failures in Kafka

« Kafka Summit London 2023

In New Relic we’ve had a lot of problems with a single broker causing disproportionate issues for Kafka processing that we never expected. We go in depth in the different scenarios that allow this to happen, the configuration which we had chosen in hopes of the best which made these outages possible or worse, and what we did to reduce the impact and still keep Kafka configured as desired.

The outages vary from shallow broker health checks combined with slow storage and certain producer configuration leading to 20+ minute full service outage because caused by a single broker. Or in another case simply trying to consume data from a broker in the same availability zone resulting in blocked processing after a broker reboots in the same AZ as the consumers. And also how we solved routing around bad brokers when producers use a partition key (which makes it a harder problem).

Presenter

Michelle Valentinova

New Relic

Michelle Valentinova started her career as a Backend Web Developer. She refocused on Systems Engineering in 2011 and has worked in Amazon, Schibsted Media Group, and most recently New Relic. In New Relic, Michelle is a Senior Site Reliability Engineer in the Kafka Platform Team, making sure that teams have the best possible experience using the Kafka service. Kafka is a key part of the ingestion and processing pipeline in New Relic.

Reducing Impact of Single Broker Failures in Kafka

Presenter

Michelle Valentinova

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how