Live Demo: Build Scalable Event-Driven Microservices with Confluent | Register Now
Every Kafka admin’s worst nightmare is to be woken up by their client application’s teams for increased latencies. One of the main culprits? Degraded infrastructure.
Degraded infrastructure refers to the partial or full unavailability of broker components, such as storage volumes or network. Degradation of storage, in particular, can lead to slower reads and writes on the broker, negatively impact performance, and can quickly devolve into unavailability.
In this talk, we will discuss how we have tackled this problem head-on with a fully automated degraded storage detection and remediation system. We’ll highlight the importance of monitoring storage performance and take a deep-dive into how we formulated the detection algorithm, created and fine-tuned our monitors, and tested this pipeline from end-to-end. We will also discuss the tools and processes developed to mitigate storage degradation. Finally, we’ll share our insights on how this streamlined detection and mitigation system improved performance and availability of clusters in Confluent Cloud.