Kora Engine, Data Quality Rules and more within our Q2'23 Launch | Register for demo
Apache Kafka® is the best enterprise streaming platform that runs straight off the shelf. Just point your client applications at your Kafka cluster and Kafka takes care of the rest: load is automatically distributed across the brokers, brokers automatically leverage zero-copy transfer to send data to consumers, consumer groups automatically rebalance when a consumer is added or removed, the state stores used by applications using the Kafka Streams API are automatically backed up to the cluster, partition leadership is automatically reassigned upon failure. It’s an operator’s dream come true!
Without needing to make any changes to Kafka configuration parameters, you can setup a development Kafka environment and test basic functionality. Yet the fact that Kafka runs straight off the shelf does not mean you won’t want to do some tuning before you go into production. The reason to tune is that different Apache Kafka use cases will have different sets of requirements that will drive different service goals.
To optimize for those service goals, there are Kafka configuration parameters that you should change. In fact, the Kafka design itself provides configuration flexibility to users, and to make sure your Kafka deployment is optimized for your service goals, you absolutely should investigate tuning the settings of some configuration parameters and benchmarking in your own environment. Ideally, you should do that before you go to production, or at least before you scale out to a larger cluster size.
We have written a white paper to help you identify those service goals, configure your Kafka deployment to optimize for them, and ensure that you are achieving them through monitoring. This paper has been refreshed to keep up with the latest features in Kafka including Kafka Streams and exactly once semantics (EOS).
The first step is to decide which service goals you want to optimize. We’ll consider four goals which often involve tradeoffs with one another: throughput, latency, durability, availability. To figure out which goals you want to optimize, recall the use cases your cluster is going to serve. Think about the applications, the business requirements—the things that absolutely cannot fail for that use case to be satisfied. Think about how Kafka as a streaming platform fits into the pipeline of your business.
Sometimes the question of which service goal to optimize is hard to answer, but you have to force your team to discuss the original business use cases and what the main goals are. There are two reasons this discussion is important.
The first reason is that you can’t maximize all goals at the same time. There are occasionally tradeoffs between throughput, latency, durability, and availability, which we cover in detail in the whitepaper. You may be familiar with the common tradeoff in performance between throughput and latency, and perhaps between durability and availability as well. To that point, I had originally considered doing this whitepaper as two separate papers: one focused on throughput and latency goals and one for durability and availability goals. However, as I considered the whole system, I realized that you can’t really think about any of them in isolation and that they all belong in a single whitepaper. This does not mean that optimizing one of these goals results in completely losing out on the others. It just means that they are all interconnected, and you can’t maximize all of them at the same time.
The second reason it is important to identify which service goal you want to optimize is that you can and should tune Kafka configuration parameters to achieve it. You need to understand what your users expect from the system to ensure you are optimizing Kafka to meet their needs.
The white paper goes into technical detail on Kafka design and relevant configuration parameters you can tune to optimize for each of these four service goals. It will guide you through the critical parts of configuring producers, brokers, and consumers, and importantly highlights tradeoffs you should consider. There are hundreds of different configuration parameters, and you will be introduced to a subset that is relevant to this discussion. The configuration parameters discussed in the white paper include:
In the paper, I provide a range of reasonable values for these configuration parameters depending on the service goal but recall that benchmarking is always crucial to validate the settings for your specific deployment. There is no “one size fits all” recommendation for the configuration parameters. Proper configuration always depends on the use case, hardware profile of each broker, what other features you have enabled, the data profile, etc. If you are tuning Kafka beyond the defaults, we generally recommend running benchmark tests.
Regardless of your service goals, you should understand what the performance profile of the cluster is—but it is especially important when you want to optimize for throughput or latency. Your benchmark tests can also feed into the calculations for determining the correct number of partitions, cluster size, and the number of producer and consumer processes.
Using the paper as a guide, you can do the benchmarking and optimizations before or even after you go into production. There are many Kafka internal metrics for servers and clients that you can monitor in the enterprise-class Confluent Control Center, and the paper provides guidance on which ones are most important for robust monitoring to ensure you are achieving your service goals. The metrics discussed will help:
You may also leverage the expertise of professional services at Confluent, the company founded by the original developers of Apache Kafka.
Download “Optimizing Your Apache Kafka Deployment” here: https://www.confluent.io/white-paper/optimizing-your-apache-kafka-deployment/
Companies are looking to optimize cloud and tech spend, and being incredibly thoughtful about which priorities get assigned precious engineering and operations resources. “Build vs. Buy” is being taken seriously again. And if we’re honest, this probably makes sense. There is a lot to optimize.
Operating Kafka at scale can consume your cloud spend and engineering time. And operating everyday tasks like scaling or deploying new clusters can be complex and require dedicated engineers. This post focuses on how Confluent Cloud is 1) Resource Efficient, 2) Fully Managed, and 3) Complete.