Kafka is easy to set up as a messaging service and serves the purpose well. However, it gets complicated in a multi-tenant environment, where users have different SLA on availability, durability and latency. As traffic grows, managing a huge and monolithic Kafka cluster in a cloud environment has been proved to be problematic and hard to scale.
At Netflix, our Kafka messaging system evolves into a multi-cluster and hierarchical service where it can serve over a trillion messages per day. Topics are allocated in either shared or dedicated clusters according to SLA requirements and can be migrated across clusters. Infrastructure routers connect Kafka clusters and provide hierarchical access to data. With the help of enhanced client libraries and proxies, clients interact with the service using higher level APIs and abstracted access points. Kafka deployments are transparent from clients. Enabled by our client libraries and Netflix cloud infrastructure, we are able to mitigate Kafka cluster level failures with our Kafka failover which is also transparent to the clients.
In this talk, we are going to discuss why this architecture is necessary and how we have implemented it with essential components including management and self-service tools, infrastructure routers, client libraries, proxies and monitoring service.
Senior Software Engineer, Netflix