What happened when our biggest and most important Kafka cluster went rogue all of a sudden, and while trying to recover it, a single, crucial misconfiguration made things even worse?
At a company like Taboola, where service availability and latency are our top priority, this was a disaster.
With 300K messages/sec and 250TB of messages produced each day to our on-premise Kafka clusters, and mirrored to our central Kafka cluster, we always try to ensure Kafka behaves well under high loads of traffic and unexpected cluster failures. So when our main Kafka cluster went crazy we had a serious issue on our hands.
This session is the story of how we learned the hard way about mitigating cluster failures with the proper configurations in place.