To provide exceptional customer experiences at scale, the data pipelines that can move data reliably across the systems and applications in real-time should be seamlessly scalable. For the past several years, we relied on Message Queue based data pipelines to facilitate the transfer of data across the applications. However, as the number of use cases that require real-time data transfer increased rapidly, it became difficult to scale the messaging platform. Moving to Kafka helped us to resolve the data pipeline scaling issues and reduce the Publisher/Subscriber on-boarding time from several weeks to a few days. To support the on-demand scaling of Kafka clusters, we run them on RedHat OpenShift, an Enterprise Kubernetes. While managing Kafka that handles critical financial events, we have learned some lessons and developed efficient strategies to manage production-grade Kafka clusters on OpenShift. In this talk, we will present:
1. Some of the challenges that we faced with Kafka on OpenShift and how we evolved our infrastructure to overcome them.
2. Share our experiences from operating Kafka clusters at Scale in Production.
3. Our strategy for performing automated Kafka deployment and rollback in OpenShift.
4. Explain our fail-over strategy using Confluent’s Replicator to ensure service availability during cluster failures.