Optimizing Your Apache Kafka^® Deployment

작성자:

Yeva ByzekIntegration Architect

May 10, 2017읽기 시간: 6 min

Apache Kafka^® is the best enterprise streaming platform that runs straight off the shelf. Just point your client applications at your Kafka cluster and Kafka takes care of the rest: load is automatically distributed across the brokers, brokers automatically leverage zero-copy transfer to send data to consumers, consumer groups automatically rebalance when a consumer is added or removed, the state stores used by applications using the Kafka Streams API are automatically backed up to the cluster, partition leadership is automatically reassigned upon failure. It’s an operator’s dream come true!

Without needing to make any changes to Kafka configuration parameters, you can setup a development Kafka environment and test basic functionality. Yet the fact that Kafka runs straight off the shelf does not mean you won’t want to do some tuning before you go into production. The reason to tune is that different Apache Kafka use cases will have different sets of requirements that will drive different service goals.

To optimize for those service goals, there are Kafka configuration parameters that you should change. In fact, the Kafka design itself provides configuration flexibility to users, and to make sure your Kafka deployment is optimized for your service goals, you absolutely should investigate tuning the settings of some configuration parameters and benchmarking in your own environment. Ideally, you should do that before you go to production, or at least before you scale out to a larger cluster size.

We have written a white paper to help you identify those service goals, configure your Kafka deployment to optimize for them, and ensure that you are achieving them through monitoring. This paper has been refreshed to keep up with the latest features in Kafka including Kafka Streams and exactly once semantics (EOS).

Decide which service goals to optimize ➝ Configure Kafka cluster and clients ➝ Benchmark, monitor, and tune

The first step is to decide which service goals you want to optimize. We’ll consider four goals which often involve tradeoffs with one another: throughput, latency, durability, availability. To figure out which goals you want to optimize, recall the use cases your cluster is going to serve. Think about the applications, the business requirements—the things that absolutely cannot fail for that use case to be satisfied. Think about how Kafka as a streaming platform fits into the pipeline of your business.

Throughput | Latency | Durability | Availability

Sometimes the question of which service goal to optimize is hard to answer, but you have to force your team to discuss the original business use cases and what the main goals are. There are two reasons this discussion is important.

The first reason is that you can’t maximize all goals at the same time. There are occasionally tradeoffs between throughput, latency, durability, and availability, which we cover in detail in the whitepaper. You may be familiar with the common tradeoff in performance between throughput and latency, and perhaps between durability and availability as well. To that point, I had originally considered doing this whitepaper as two separate papers: one focused on throughput and latency goals and one for durability and availability goals. However, as I considered the whole system, I realized that you can’t really think about any of them in isolation and that they all belong in a single whitepaper. This does not mean that optimizing one of these goals results in completely losing out on the others. It just means that they are all interconnected, and you can’t maximize all of them at the same time.

The second reason it is important to identify which service goal you want to optimize is that you can and should tune Kafka configuration parameters to achieve it. You need to understand what your users expect from the system to ensure you are optimizing Kafka to meet their needs.

Do you want to optimize for high throughput, which is the rate that data is moved from producers to brokers or brokers to consumers? Some use cases have millions of writes per second. Because of Kafka’s design, writing large volumes of data into it is not a hard thing to do. It’s faster than trying to push volumes of data through a traditional database or key-value store, and it can be done with modest hardware.
Do you want to optimize for low latency, which is the elapsed time moving messages end-to-end (from producers to brokers to consumers)? One example of a low-latency use case is a chat application, where the recipient of a message needs to get the message with as little latency as possible. Other examples include interactive websites where users follow posts from friends in their network or real-time stream processing for Internet of Things.
Do you want to optimize for high durability, which guarantees that messages that have been committed will not be lost? One example use case for high durability is an event-driven microservices pipeline using Kafka as the event store. Another is for integration between a streaming source and some permanent storage (e.g. AWS S3) for mission critical business content.
Do you want to optimize for high availability, which minimizes downtime in case of unexpected failures? Kafka is a distributed system, and it is designed to tolerate failures. In use cases demanding high availability, it’s important to configure Kafka such that it will recover from failures as quickly as possible.

The white paper goes into technical detail on Kafka design and relevant configuration parameters you can tune to optimize for each of these four service goals. It will guide you through the critical parts of configuring producers, brokers, and consumers, and importantly highlights tradeoffs you should consider. There are hundreds of different configuration parameters, and you will be introduced to a subset that is relevant to this discussion. The configuration parameters discussed in the white paper include:

Producer:

batch.size
linger.ms
compression.type
acks
retries
replication.factor
enable.idempotence
max.in.flight.requests.per.connection
buffer.memory

Consumer:

fetch.min.bytes
enable.auto.commit
isolation.level
session.timeout.ms

Streams:

StreamsConfig.TOPOLOGY_OPTIMIZATION
StreamsConfig.REPLICATION_FACTOR_CONFIG
StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG
StreamsConfig.PROCESSING_GUARANTEE_CONFIG

Broker:

default.replication.factor
num.replica.fetchers
auto.create.topics.enable
min.insync.replicas
unclean.leader.election.enable
broker.rack
log.flush.interval.messages
log.flush.interval.ms
unclean.leader.election.enable
min.insync.replicas
num.recovery.threads.per.data.dir

In the paper, I provide a range of reasonable values for these configuration parameters depending on the service goal but recall that benchmarking is always crucial to validate the settings for your specific deployment. There is no “one size fits all” recommendation for the configuration parameters. Proper configuration always depends on the use case, hardware profile of each broker, what other features you have enabled, the data profile, etc. If you are tuning Kafka beyond the defaults, we generally recommend running benchmark tests.

Regardless of your service goals, you should understand what the performance profile of the cluster is—but it is especially important when you want to optimize for throughput or latency. Your benchmark tests can also feed into the calculations for determining the correct number of partitions, cluster size, and the number of producer and consumer processes.

Using the paper as a guide, you can do the benchmarking and optimizations before or even after you go into production. There are many Kafka internal metrics for servers and clients that you can monitor in the enterprise-class Confluent Control Center, and the paper provides guidance on which ones are most important for robust monitoring to ensure you are achieving your service goals. The metrics discussed will help:

Gauge the load on the brokers
Reveal where processing time is being spent
Point to areas of potential bottlenecks
Show consumer lag which is important for real-time applications
Indicate general cluster health

You may also leverage the expertise of professional services at Confluent, the company founded by the original developers of Apache Kafka.

Download “Optimizing Your Apache Kafka Deployment” here: https://www.confluent.io/white-paper/optimizing-your-apache-kafka-deployment/

Yeva is an integration architect at Confluent designing solutions and building demos for developers and operators of Apache Kafka. She has many years of experience validating and optimizing end-to-end solutions for distributed software systems and networks.