Availability is a key metric for any Kafka deployment, but when every event is critical the system must be centered around keeping publishers and consumers highly available, even when a Kafka cluster goes down. At Stripe our core business relies on Kafka, and as we outgrew a single Kafka cluster we had to build a multi-cluster system which would fit our needs while supporting a target of 99.9999% availability for our most critical use cases.
In this talk we’ll discuss our solution to this problem: an in-house proxy layer and multi-cluster toplogy which we’ve built and operated over the past 3 years. Our proxy layer enables multiple Kafka clusters to work in coordination across the globe, while hitting our ambitious availability targets and providing clean client abstractions.
In this talk we’ll discuss how our Kafka deployment provides: availability for both publishers and consumers in the face of cluster outages, increased security and observability, simplified cluster maintenance, and global routing for constraints such as data locality. We’ll highlight the benefits & tradeoffs of our approach, the design of our proxy layer, Kafka configuration decisions, and where we’re planning to go from here.