Are you running at scale? Did you experience “voodoo problems” in your infrastructure? We have a 5M messages/sec cluster that taught us some valuable lessons. Seeing our Kafka clusters become sluggish or crash, taking our production services with them, we have some insights that we hope help you steer your next production incident and make sure your data pipelines run smoothly. We’ll tell the story of skews and anomalies in CPU and disk metrics - drawing graphs and conclusions. Understand how compacted topics, partitions distribution, and RAM can affect your cluster’s performance. Finally, look at how a small configuration drift can rattle your cluster. Our goal is to provide you with the tools and knowledge to navigate this uncharted territory.