Project Metamorphosis: Unveiling the next-gen event streaming platformLearn More

Running Kafka as a Service at Scale

Kafka Summit SF 2017 | Systems Track

Apache Kafka is recognized as the world’s leading real-time, fault tolerant, highly scalable stream platform. It is adopted very widely across thousands of companies worldwide from web giants like LinkedIn, Netflix, Uber to large enterprises like Apple , Cisco, Goldman Sachs and more.

In this talk, we will look at what Confluent has done along with the help from the community to enable running Kafka as a fully managed service. The engineers at Confluent spent multiple years running Kafka as a service and learnt very valuable lessons in that process. They understood how things are very different when you run in a controlled environment inside a single company vs running Kafka for thousands of companies. This talk will go over those valuable lessons and what we have built in Kafka as a result which is available to all Kafka users as part of Confluent Cloud.

We will cover three key aspects

  • Resiliency – A data system needs to be highly available and should never loose data. Kafka is no different. We are very paranoid about the guarantees that Kafka provides and we have taken a lot of effort to ensure Kafka is extremely durable and highly available. To achieve this, we have taken various steps including improving the replication protocol, rewriting the controller, reducing zk failures and more. We will go over each of these improvements.
  • Observability – To run any data system as a service, we need to be able to measure and alert on key metrics. We have spent a lot of time adding tons of metrics to Kafka based on our previous experience to ensure it can be easily monitored. This includes better client, storage, replication and controller metrics to ensure any external alerting system can monitor and alert on these metrics. We will go over some of them in this talk and describe why they are important.
  • Extensive testing – Confluent runs 1000s of tests nightly and have run them for hundreds of hours now. Based on our testing, we have been able to proactively identify issues and work with the community to fix them. We will discuss about the different tests we write, the types of fault injections we do and the issues we have identified and fixed in this process.

We will also touch a bit on code quality and compile time bug identification approaches we employ to ensure we build a highly reliable system that is essential to run Kafka as a service with top notch SLAs.

Sriram Subramanian
Director, Platform & Infra Engineering, Confluent

Sign Up Now

Start your 3-month trial. Get up to $200 off on each of your first 3 Confluent Cloud monthly bills

New signups only.

By clicking “sign up” above you understand we will process your personal information in accordance with our Privacy Policy.

By clicking "sign up" above you agree to the Terms of Service and to receive occasional marketing emails from Confluent. You also understand that we will process your personal information in accordance with our Privacy Policy.

Get Confluent Cloud

Get up to $200 off on each of your first 3 Confluent Cloud monthly bills


Choose one sign-up option below

Marketplaces

  • AWS
  • Azure
  • Google Cloud

  • Billed through your Cloud provider*
  • Stream only on 1 cloud
*Billing admin role needed

Marketplaces

  • Billed through your Cloud provider*
  • Stream only on 1 cloud
  • Billing admin role needed

*Billing admin role needed

Confluent


  • Pay with a credit card
  • Stream across multiple clouds

Confluent

  • Pay with a credit card
  • Stream across multiple clouds

By clicking “sign up” above you understand we will process your personal information in accordance with our Privacy Policy.

By clicking "sign up" above you agree to the Terms of Service and to receive occasional marketing emails from Confluent. You also understand that we will process your personal information in accordance with our Privacy Policy.

Free Forever on a Single Kafka Broker
i

The software will allow unlimited-time usage of commercial features on a single Kafka broker. Upon adding a second broker, a 30-day timer will automatically start on commercial features, which cannot be reset by moving back to one broker.

Select Deployment Type
Manual Deployment
  • tar
  • zip
  • deb
  • rpm
  • docker
or
Auto Deployment
  • kubernetes
  • ansible

By clicking "download free" above you understand we will process your personal information in accordance with our Privacy Policy.

By clicking "download free" above, you agree to the Confluent License Agreement and to receive occasional marketing emails from Confluent. You also agree that your personal data will be processed in accordance with our Privacy Policy.

This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising, and analytics partners.