Confluent
Optimizing Your Apache Kafka™ Deployment
Apache Kafka

Optimizing Your Apache Kafka™ Deployment

Yeva Byzek

Apache KafkaTM is the best enterprise streaming platform that runs straight off the shelf. Just point your client applications at your Kafka cluster and Kafka takes care of the rest: load is automatically distributed across the brokers, brokers automatically leverage zero-copy transfer to send data to consumers, consumer groups automatically rebalance when a consumer is added or removed, the state stores used by applications using the Kafka Streams API are automatically backed up to the cluster, partition leadership is automatically reassigned upon failure. It’s an operator’s dream come true!

Without needing to make any changes to Kafka configuration parameters, you can setup a development Kafka environment and test basic functionality. Yet the fact that Kafka runs straight off the shelf does not mean you won’t want to do some tuning before you go into production. The reason to tune is that different Apache Kafka use cases will have different sets of requirements that will drive different service goals. To optimize for those service goals, there are Kafka configuration parameters that you should change. In fact, the Kafka design itself provides configuration flexibility to users, and to make sure your Kafka deployment is optimized for your service goals, you absolutely should investigate tuning the settings of some configuration parameters and benchmarking in your own environment. Ideally, you should do that before you go to production, or at least before you scale out to a larger cluster size.

We have written a white paper to help you identify those service goals, configure your Kafka deployment to optimize for them, and ensure that you are achieving them through monitoring.

The first step is to decide which service goals you want to optimize. We’ll consider four goals which often involve tradeoffs with one another: throughput, latency, durability, availability. To figure out which goals you want to optimize, recall the use cases your cluster is going to serve. Think about the applications, the business requirements—the things that absolutely cannot fail for that use case to be satisfied. Think about how Kafka as a streaming platform fits into the pipeline of your business.

Sometimes the question of which service goal to optimize is hard to answer, but you have to force your team to discuss the original business use cases and what the main goals are. There are two reasons this discussion is important.

The first reason is that you can’t maximize all goals at the same time. There are occasionally tradeoffs between throughput, latency, durability, and availability, which we cover in detail in the whitepaper. You may be familiar with the common tradeoff in performance between throughput and latency, and perhaps between durability and availability as well. To that point, I had originally considered doing this whitepaper as two separate papers: one focused on throughput and latency goals and one for durability and availability goals. However, as I considered the whole system, I realized that you can’t really think about any of them in isolation and that they all belong in a single whitepaper. This does not mean that optimizing one of these goals results in completely losing out on the others. It just means that they are all interconnected, and you can’t maximize all of them at the same time.

The second reason it is important to identify which service goal you want to optimize is that you can and should tune Kafka configuration parameters to achieve it. You need to understand what your users expect from the system to ensure you are optimizing Kafka to meet their needs.

  • Do you want to optimize for high throughput, which is the rate that data is moved from producers to brokers or brokers to consumers? Some use cases have millions of writes per second. Because of Kafka’s design, writing large volumes of data into it is not a hard thing to do. It’s faster than trying to push volumes of data through a traditional database or key-value store, and it can be done with modest hardware.
  • Do you want to optimize for low latency, which is the elapsed time moving messages end-to-end (from producers to brokers to consumers)? One example of a low-latency use case is a chat application, where the recipient of a message needs to get the message with as little latency as possible. Other examples include interactive websites where users follow posts from friends in their network or real-time stream processing for Internet of Things.
  • Do you want to optimize for high durability, which guarantees that messages that have been committed will not be lost? One example use case for high durability is an event-driven microservices pipeline using Kafka as the event store. Another is for integration between a streaming source and some permanent storage (e.g. AWS S3) for mission critical business content.
  • Do you want to optimize for high availability, which minimizes downtime in case of unexpected failures? Kafka is a distributed system, and it is designed to tolerate failures. In use cases demanding high availability, it’s important to configure Kafka such that it will recover from failures as quickly as possible.

The white paper goes into technical detail on Kafka design and relevant configuration parameters you can tune to optimize for each of these four service goals. It will guide you through the critical parts of configuring producers, brokers, and consumers, and importantly highlights tradeoffs you should consider. There are hundreds of different configuration parameters, and you will be introduced to a subset that is relevant to this discussion. The configuration parameters discussed in the white paper include:

Producer:

  • batch.size
  • linger.ms
  • compression.type
  • acks
  • retries
  • max.in.flight.requests.per.connection
  • buffer.memory

Broker:

  • default.replication.factor
  • num.replica.fetchers
  • auto.create.topics.enable
  • min.insync.replicas
  • unclean.leader.election.enable
  • broker.rack
  • log.flush.interval.messages
  • log.flush.interval.ms
  • unclean.leader.election.enable
  • min.insync.replicas
  • num.recovery.threads.per.data.dir

Consumer:

  • fetch.min.bytes
  • auto.commit.enable
  • session.timeout.ms

In the paper, I provide a range of reasonable values for these configuration parameters depending on the service goal but recall that benchmarking is always crucial to validate the settings for your specific deployment. There is no “one size fits all” recommendation for the configuration parameters. Proper configuration always depends on the use case, hardware profile of each broker, what other features you have enabled, the data profile, etc. If you are tuning Kafka beyond the defaults, we generally recommend running benchmark tests. Regardless of your service goals, you should understand what the performance profile of the cluster is — but it is especially important when you want to optimize for throughput or latency. Your benchmark tests can also feed into the calculations for determining the correct number of partitions, cluster size, and the number of producer and consumer processes.

Using the paper as a guide, you can do the benchmarking and optimizations before or even after you go into production. There are many Kafka internal metrics for servers and clients that you can monitor in the enterprise-class Confluent Control Center, and the paper provides guidance on which ones are most important for robust monitoring to ensure you are achieving your service goals. The metrics discussed will help:

  • Gauge the load on the brokers
  • Reveal where processing time is being spent
  • Point to areas of potential bottlenecks
  • Show consumer lag which is important for real-time applications
  • Indicate general cluster health

You may also leverage the expertise in the professional services team at Confluent, the company which was founded by the original developers of Apache Kafka: Learn more at https://www.confluent.io/services/

Download “Optimizing Your Apache Kafka Deployment” here: https://www.confluent.io/white-paper/optimizing-your-apache-kafka-deployment/

Subscribe to the Confluent Blog

Subscribe
Email *

More Articles Like This

Chrix Finne

Confluent 3.2 with Apache Kafka™ 0.10.2 Now Available

Chrix Finne . .

We’re excited to announce the release of Confluent 3.2, our enterprise streaming platform built on Apache Kafka. At Confluent, our vision is to provide a comprehensive, enterprise-ready streaming platform that ...

Gwen Shapira

The First Annual State of Apache Kafka™ Client Use Survey

Gwen Shapira . .

At the end of 2016 we conducted a survey of the Apache Kafka™ community regarding their use of Kafka clients (the producers and consumers used with Kafka) and their priorities ...

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

  1. Hi Yeva,

    I am Manu, a Senior Software Engineer at Prysm. Wonderful article on Apache Kafka deployment in production, We are at Prysm planning to use Kafka for distributed messaging and computing. I had few questions regarding the way client application connect to Kafka broker in Production.

    We have a WPF Client and we want to ship log data from Client to Server by connecting to Kafka broker(Cloud) over TCP/IP, but the problem is we have 200 customers which ends up in 200 WPF Client and 200 connections to Kafka Broker. Lets say over period of time if the customer grows and we have 1000 clients, it will end up with 1000 connections to Kafka broker.

    Is it ok to use so many connections? Is it a best practice to expose Kafka Broker over TCP/IP?
    I am looking forward for your comments and suggestions.

    Best regards,
    Manu

    1. In general, you can scale up the number of TCP connections as long as the request rate is not too high, which you can monitor on the broker with RequestHandlerAvgIdlePercent and NetworkProcessorAvgIdlePercent and other metrics mentioned in the white paper. We have seen production deployments with tens of thousands of TCP connections to individual brokers, so 1000 should be fine. I also suggest monitoring latency because network latency tends to be higher in the cloud. For further discussion on this topic, please feel free to reach out to Confluent’s Community Slack or Google group (linked from https://www.confluent.io/contact/)

Try Confluent Platform

Download Now