Freight Clusters: Up to 90% savings at GBps+ scale | Learn more

Presentation

Reducing Impact of Single Broker Failures in Kafka

« Kafka Summit London 2023

In New Relic we’ve had a lot of problems with a single broker causing disproportionate issues for Kafka processing that we never expected. We go in depth in the different scenarios that allow this to happen, the configuration which we had chosen in hopes of the best which made these outages possible or worse, and what we did to reduce the impact and still keep Kafka configured as desired.

The outages vary from shallow broker health checks combined with slow storage and certain producer configuration leading to 20+ minute full service outage because caused by a single broker. Or in another case simply trying to consume data from a broker in the same availability zone resulting in blocked processing after a broker reboots in the same AZ as the consumers. And also how we solved routing around bad brokers when producers use a partition key (which makes it a harder problem).

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how