How do we define and measure availability in a distributed system? A great thing about distributed systems is that they are built to tolerate failures in a way that limits downtime to users. However, this means that availability is a bit more complicated than ""the system is up"" or ""the system is down.""
Even if the system is built to tolerate failures, we may see individual components lose availability due to: * cloud provider outages * high latencies * load balancer and/or routing issues * storage failures * hardware issues
Using Apache Kafka and Confluent Cloud as a case study, we will dig deeper into how to define good SLOs and SLAs for distributed systems. From there we will discuss ways to improve availability and the changes we made to Confluent Cloud to improve on Kafka's availability story.