Some “Kafkaesque” Days in Operations at Linkedin in 2015

Some “Kafkaesque” Days in Operations at Linkedin in 2015

Watch Video

Kafka Summit 2016 | Operations Track

Kafka is a backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. 2015 was an exciting year at LinkedIn in that we hit a new level of scale with Kafka: we now process more than 1 trillion published messages per day across nearly 1300 brokers. We run into some interesting production issues at this scale and I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year:

Data loss: We have extremely stringent SLAs on latency and completeness that were violated on a few occasions. Some of these incidents were due to subtle configuration problems or even missing features.

Offset resets: As of early 2015, Kafka-based offset management was still a relatively new feature and we occasionally hit offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management/log compaction as well as our monitoring.

Cluster unavailability due to high request/response latencies: Such incidents demonstrate how even subtle performance regressions and monitoring gaps can lead to an eventual cluster meltdown.

Power failures! What happens when an entire data center goes down? We experienced this first hand and it was not so pretty.

and more…

This talk will go over how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that we are working on that will help eliminate or mitigate such incidents in the future.

Speaker:

Joel Koshy, Staff Software Engineer, LinkedIn

We use cookies to understand how you use our site and to improve your experience. Click here to learn more or change your cookie settings. By continuing to browse, you agree to our use of cookies.

Agenda

GENERAL SESSIONS
08:00 – 09:00 Registration & Partner Showcase
09:00 – 9:10 Welcome
09:10 – 09:45 Keynote: State of Streaming, what’s next
09:45 – 10:15 Keynote: Customer – Euronext ??
10:15 – 10:45 Partner Session
10:45 – 11:15 Coffee Break & Sponsor Showcase
11:15 – 11:45 Keynote: Customer – Criteo ??
11:45 – 12:15 Partner Session
12:15 – 13:00
Lunch
13:00 – 14:00
14:00 – 14:30 CONFLUENT Monitor Kafka Like a Pro with C3
14:30 – 15:00 Customer Use Case – Credit Mutuel ??
15:00 – 15:45 CONFLUENT KSQL in Production
15:45 – 16:15 Coffee Break & Sponsor Showcase
16:15 – 16:35 CONFLUENT : Global Kafka / Kubernetes
16:35 – 16:55 Partner Session
16:55 – 17:15 Customer Case : BNP Secu ?
17:15 – 19:00 Topic Corners & Get Together