Apache Kafka is well known as a low-latency, high-throughput and highly configurable streaming platform. At AWS, we run thousands of Kafka clusters, each cluster with different hardware and software configurations. Managing such a large and diverse Kafka fleet has taught us several operational lessons. We would like to share some of these lessons with you.
We’ll talk about several topics including (a) monitoring Kafka health, (b) optimizing Kafka to address compute, storage and networking bottlenecks, (c) automating detection and mitigation of infrastructure failures related to compute, storage and networking and (d) continuous software patching.