Kafka의 가용성

« Current 2022

How do we define and measure availability in a distributed system? A great thing about distributed systems is that they are built to tolerate failures in a way that limits downtime to users. However, this means that availability is a bit more complicated than ""the system is up"" or ""the system is down.""

Even if the system is built to tolerate failures, we may see individual components lose availability due to: * cloud provider outages * high latencies * load balancer and/or routing issues * storage failures * hardware issues

Using Apache Kafka and Confluent Cloud as a case study, we will dig deeper into how to define good SLOs and SLAs for distributed systems. From there we will discuss ways to improve availability and the changes we made to Confluent Cloud to improve on Kafka's availability story.

발표자

Justine Olshan

Confluent

Justine graduated from Carnegie Mellon University in 2020 with a degree in computer science. During summer 2019, she was a software engineering intern at Confluent where she worked on improving the Apache Kafka® producer. After graduating, she returned full time to Confluent and continues to work on improving Kafka through various KIPs including KIP-516 which introduced topic IDs to Kafka and KIP-890 which strengthened the transactional protocol. She became an Apache Kafka committer in 2022 and PMC member in 2023.

Kafka의 가용성

발표자

Justine Olshan

Related Links

How Confluent Completes Apache Kafka eBook

Leverage a cloud-native service 10x better than Apache Kafka

Confluent Developer Center

Spend less on Kafka with Confluent, come see how