Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
At Current 2023, we announced that Confluent Cloud is now up to 10x faster than Apache Kafka®, thanks to Kora, The Cloud-Native Kafka engine that powers Confluent Cloud. In this blog post, we will cover what that means in more depth.
Running an “apples-to-apples” performance comparison between Apache Kafka and the fully managed Confluent Cloud can be tricky. For example, Confluent Cloud has additional essential services (e.g., observability, durability auditing, billing, logging—see the award-winning VLDB paper on Kora for more details) running in the background on each node of our fully managed SaaS, keeping our 30K+ clusters durable, secure, and highly available. These functions consume resources that have no equivalent in the open source platform, so the comparison slightly penalizes Confluent Cloud. Kora is playing with a hand tied behind its back, but we saw this as just another challenge on the road to deliver better performance to our customers.
Building and improving Kora is a continuous process, but now we are happy to share that Confluent Cloud customers can:
Get their data to where it’s needed up to 10x faster with more predictable p99 latency across various workload profiles
Keep latency reliably low even under cloud provider infrastructure or service disruptions with Kora’s automatic monitoring and mitigation (no paging or any on-call involvement needed!)
Enjoy continuous performance optimizations without lifting a finger through seamless upgrades and automatic tuning
The remainder of this blog post is structured as follows.
First, we will walk through how we measure latency in Confluent Cloud and the need to improve latency for our customers both under steady state and in more real-world conditions (exposed to external factors such as workload changes, infrastructure degradation, and maintenance upgrades).
Second, we will share our learnings to improve steady-state latency and benchmarks comparing Confluent Cloud and Apache Kafka.
Finally, we will conclude with our learnings from improving latency fluctuations due to various external factors and show real customer examples of where this has shown benefit.
The first step in reasoning about performance is deciding what to measure and how to do it. Benchmarks often measure performance in a lab environment with synthetic workloads, which may not fully reflect what customers experience in production. While truly observing customers’ experience requires client-side metrics (which we hope KIP-714 will address), we use health check services as a close approximation in Confluent Cloud. Health checks are part of Kora and rely on Apache Kafka producers and consumers to periodically produce and consume messages through the same path the user’s data would travel through. These services are used to measure end-to-end (E2E) latency—the aggregated time it takes for a producer to send a message and the consumer to read it—and to alert us when the observed latency breaches our internal SLOs.
Having run a managed cloud SaaS offering for many years observing customer requests and our internal SLOs week over week, it is abundantly clear to us that there is a distinction between optimizing for latency under steady state vs latency fluctuations caused by external factors (e.g., due to network jitters, noisy neighbors, managed cloud service degradations, zonal failures, software upgrades, workload fluctuations, etc.). While most performance benchmarks focus on showcasing improvements to steady-state latencies, in the real world it’s latency fluctuations that often cause significant disruptions to customers’ true performance experience. We invested in improving both to make sure Confluent Cloud’s latency is low, and stays consistent, in any scenario.
Now let’s get to what you really want to see—benchmarks against Apache Kafka, and our learnings from the results.
We benchmarked many of the workload patterns we see in Confluent Cloud. Let’s showcase a few examples of the results we saw:
With the same hardware setup, Confluent Cloud’s Kora engine outperforms Apache Kafka by up to 16x, as measured by p99 E2E latencies, across various workload profiles with aggregate throughput (ingress+egress) from 30 MBps to 5 GBps+
Confluent Cloud’s latencies stay low and more stable as throughput and partition scales, while Apache Kafka’s performance degrades significantly at heavier loads
Confluent Cloud’s performance is more predictable even at tail percentiles
We ran our benchmarks in Confluent Cloud’s dedicated offering which abstracts the underlying infrastructure in terms of CKUs (Confluent Unit for Kafka). CKUs provide consistent cross-cloud limits on various dimensions of Apache Kafka so that users don’t have to think of underlying hardware but instead think in terms of their workload to CKU mapping and scale up/down their CKUs as necessary. For the purpose of this benchmark, we tested the dimensions of ingress/egress and partition limits of Confluent Cloud against similarly resourced Apache Kafka clusters at 2 different scales to have workload variations across throughput and number of topic-partitions. Though we also tested different fanouts, keyed workloads, and different producer/consumer counts which can impact the latency experienced, we won’t list all of them here for brevity. In a majority of those use cases, Confluent Cloud outperforms Apache Kafka, whereas in other cases there is more work cut out for us. The limits for the different scales we tested are as follows.
To compare against Apache Kafka, we provisioned a similar hardware setup as that of Confluent Cloud (in AWS) with an equivalent number and type of broker compute, EBS storage, and client instances. The number of brokers per CKU and the size of the instance type also vary quite a bit within Confluent Cloud—we use various instance types to satisfy the limits and abstractions at various CKU counts. We also increased the disk size, IOPS, and throughput of EBS for Apache Kafka to be on par with Confluent Cloud for the tests. We also ensured that there were not any client bottlenecks by scaling up the clients along with the number of CKUs being tested.
All tests were run on Apache Kafka 3.6 with KRaft. The KRaft controllers were running in three separate instances. To compare against higher CKU counts in Confluent Cloud, the Apache Kafka server-side configs for the number of replica fetchers, network threads, and I/O threads were increased to num.replica.fetchers=16, num.network.threads=16, num.io.threads=16
. Using default configs would have made the Confluent Cloud improvements much higher.
Unless otherwise noted, our benchmarks used the following client configurations and other specifications:
We invite you to try out these workloads and configs for any of our tests on Open Messaging Benchmark in the eu-central-2
AWS region in Confluent Cloud where this is available by default.
We started the benchmarking exercise by examining Confluent Cloud’s performance consistency as throughput scales at fixed partition counts. By achieving stable latency profiles across different levels of loads, we help ensure latency is consistent and predictable whether users are supporting small workloads or handling massive traffic spikes like Black Friday events.
We experimented across 2 different CKU counts, 2 and 28 CKUs, which correspond to different customer use cases we see in Confluent Cloud in their data streaming evolution. We varied the throughput up to the limit each CKU supports, keeping the total number of partitions at each CKU constant (partitions per topic were always constant at 200). The throughput varied from 10 MBps to 1.4 GBps ingress (which is 30 MBps to 5.6 GBps aggregate). We see that Confluent Cloud significantly outperforms Apache Kafka as throughput scales across the different CKUs with up to 12x improvements.
We also altered the partition count per topic dimension in 2 and 28 CKU while keeping the throughput consistent to see whether partition count changes will affect our system’s performance. Confluent Cloud outperforms Apache Kafka with up to 11.5x performance improvements.
*Note that at low partition count per CKU and high throughput, background tiering activity interferes with foreground produce/consume processing. Apache Kafka was run without KIP-405 (Kafka Tiered Storage) since that isn’t default yet, and hence this is not an apples-to-apples comparison. Though we haven’t seen this scenario in production (where all partitions tier at the same time), we are working on further improving QoS to minimize the impact of background operations on foreground processing as we evolve our architecture.
We have been using p99 E2E latency across our benchmarks because it is the most common requirement from our latency-sensitive customers. We believe that performance benchmarks are only reliable if they use metrics that reflect what users actually measure in their daily work.
Since p95 or average latency may also be relevant for less latency-sensitive users, we prepared the results at different percentiles below using the 28 CKU (1.4 GBps/500 topics) scenario. As shown in Figure 8, Confluent Cloud’s E2E latency stays low even at tail latencies (16x more superior), showing that its performance is more stable and predictable than Apache Kafka.
Kora is a fresh take on what Apache Kafka can be in the cloud. Since we published our VLDB paper, our architecture has been evolving continuously. Kora has had some fundamental changes all across the stack for improved performance, including but not limited to increased batching and parallelism, out-of-order processing, and a new replication protocol. These improvements have happened in tandem with a continuously evolving security, durability, and reliability posture to deliver Kora at a competitive price for our customers. While steady-state improvements are great, the even harder part is delivering consistent performance in the real world. Let's talk more about that in the next section.
When planning to choose an Apache Kafka offering in the cloud, most users may just do a steady-state benchmark comparison like the ones above, and go with the vendor that can satisfy their requirements. However, most performance benchmarks only showcase results for “when things go well” while overlooking day two operations like network jitters, service degradations, or zonal failures which can contribute to a significant number of disruptions to latency.
Our philosophy towards performance was not just to look good on paper, but rather to stay persistent and predictable in the real world through any type of disruptions. We’ll cover some of the continuous optimizations we‘ve implemented over the years.
Self-Balancing Clusters have been a staple of Confluent Cloud’s elasticity. As part of the efforts examining and optimizing Kora’s brokers for latency, we saw an opportunity to further improve the speed at which our self-balancing algorithm reacts to customer workload fluctuations. For example, a customer who uses a single Confluent Cloud cluster across different internal organizations (i.e., multi-tenanted) could see their aggregated performance altered with any sudden change in any internal organization’s behavior. The original algorithm was very heavy-weight in that it tried to balance the cluster for multiple variables like partition distribution and capacity constraints (ingress, egress) in a single iteration. This was inefficient because the balancing consumed a lot of CPU, it led to large rebalances which took longer (by the time the rebalance finished the workload could have further changed), and some of the variables didn’t correlate close enough to warrant being done together (i.e, ingress/egress distribution could be different than partition distribution). Our enhanced self-balancing feature was completely re-architected to be more real-time and lightweight to ensure it reacts swiftly to workload changes.
Cloud infrastructure and services experience degradation all the time. To minimize the impact on latencies during these degradations, we have a robust pipeline that automatically and reliably:
Detects when a broker is experiencing degraded hardware.
Takes the broker out of the critical path by draining the broker of all partition leadership and in-sync replica (ISR) membership; we call this broker demotion. This prevents the degraded broker from impacting latencies as none of its partition replicas form part of the ISR (but still get replicated to asynchronously). Note that we do have safeguards in place to ensure that there is no availability/durability loss due to the loss of an AZ in this mode.
Once the underlying hardware issue is resolved, the replicas on the broker that lost leadership are restored to normal operational mode and are reelected if doing so respects the self-balancing cluster optimal leader placement.
It should be noted that this mechanism protects against latency spikes; issues that affect redundancy are treated differently by moving offline replicas to another broker to maintain the replication factor.
Example of handling latency spikes with infrastructure degradation monitoring in the fleet: Consider a multi-tenant Kora instance where a storage degradation caused a corresponding spike in the health-check monitoring around 10:46 a.m. in Kafka-4, the fifth broker of an 8-broker cluster. Note that Kafka-3 and Kafka-2’s latency are also affected due to them being in the ISR. This is a concern in any leader-based replication protocol as any infrastructure degradations can cause performance issues but not be enough to trigger failure detection. As a result, a more statistical notion of failure detection is needed to detect and automatically remediate these failures. Around 10:50 a.m., the process explained above detects and mitigates the problem through broker demotion and brings the latency down automatically—no paging, no alarms, no on-call involvement, completely automatic. Kafka-4 has no data points after 10:50 a.m., since it no longer hosts any partition leadership, until 11:05 a.m. when it is restored (the observability platform graphs this as a downward straight line between those two data points). This same mechanism has triggered thousands of times across our fleet in the last month alone.
Kafka health check E2E latencies (p99, ms)
As a managed service we continuously update our customers’ clusters for various reasons including but not limited to security patches, new features, and better hardware. With Apache Kafka, the maintenance and upgrade process can be quite disruptive as it entails leadership movements (i.e., moving the leaders out of a broker to patch it) on the server side which requires a series of steps including processing metadata updates, truncating the log (if necessary), restarting replica fetchers, and so on. Clients also have to rediscover the new leader. This process repeats when the upgraded broker comes back up and the leadership migrates back to the upgraded broker (which could cause disruptions again).
We have worked on improving this problem in Kora over the last year to decrease the mean time to synchronize partition leadership changes between leaders, followers, and clients by improving the metadata propagation protocol, parallelizing leadership change application, and carefully orchestrating leadership movement to warm up new brokers. We show below a benchmark of doing an upgrade (roll) in Apache Kafka vs Kora. We are continuously improving on making this better across various scenarios as this is a death-by-a-thousand-cuts kind of problem to ensure there is no need for maintenance windows in the first place and all patching is handled behind the scenes with no or minimal impact on the workload.
Simulation setup: We simulated a roll in Apache Kafka by restarting brokers in sequence as a normal patching process would proceed when a workload was running using the Open Messaging Benchmark workload. Confluent Cloud was running 2 CKUs while Apache Kafka was running 3.6 (KRaft mode) with the same configuration as explained in the steady state benchmark section. The workload here has 9000 partitions across 10 topics.
As you can see, the latency spikes up during the roll of every broker in Apache Kafka (sometimes up to multiple seconds) whereas Confluent Cloud is relatively seamless with only occasional spikes. Note that since client behavior also significantly influences the latencies observed during a roll, it is always a good idea to upgrade clients to the latest and use ones that are supported widely.
Hopefully, this gave you an interesting look into our ongoing performance journey in Confluent Cloud. All of this hard work wouldn’t be possible without the devoted support from all of our Kora engineering team, especially:
Prince Mahajan, Niket Goel, David Mao, Fred Zheng, Dimitar Dimitrov, Ashwin Sangem, Badai Aqrandista, Crispin Bernier, Chris Flood, Harish Sharma, Elena Anyusheva, Ning Shan, Shimiao Zhang, Srinivas Akhil Mallela, Yuying Li, Jason Gustafson, Josh Hanson, Yun Fu, Xavier Leaute, Adithya Chandra, Rittika Adhikari, Michael Borokhovich, Nimisha Mehta, Feng Min, Gopi Attaluri, Jose Garcia Sancio, Mayank Narula, Scott Hendricks, Vikas Singh, Lingnan Liu, Ismael Juma, Dhruvil Shah, Nikhil Bhatia, Anil Sharma, Anna Povzner, Tiffany Jianto, Yi Ding, Arun Singhal, Jon Chiu, Sameer Tejani, Harish Vishwanath, Giuseppe Spada, David Stockton, Praveen Balasubramanian.
Engineering a data system to lower latency is not just about running profilers, optimizing code and algorithms, and running performance benchmarks to measure the improvements. These things are important but only half the story; the other half is ensuring that the software, in a production setting, also performs well with predictable latencies. Production is where hardware fails, or worse, degrades; where clusters need to get rolled for security patches and upgrades, and where we run a suite of monitoring and security software that can compete for resources. This other half doesn't appear in benchmarks and it is where operational excellence comes into play.
For us, there’s nothing better than a customer waking up to latencies being improved overnight:
“Speed and performance are one of the reasons that we chose Confluent in the first place. One time, we saw a huge improvement, 8x in fact, in latency and cluster load overnight. We actually thought something was wrong with our Confluent Cloud cluster and opened a support ticket. Turned out it was the rollout of a Kora engine performance update. We were able to get so much more out of the capacity that we were able to shrink our cluster and save some money with no impact to our applications.” - BMW Group
We are continuously evolving our architecture and innovating on our customers’ behalf to provide a better price and performance with the Kora engine. Stay tuned and we can’t wait to share all the exciting things we’re working on in 2024.
Want to take the now turbo-charged Confluent Cloud for a spin? Sign up for free!
Take a tour of the internals of Confluent’s Apache Kafka® service, powered by Kora: the next-generation, cloud-native streaming engine.
People often imagine that to provide a cloud service for a piece of open source software is a simple matter of packaging up the open source and putting it in […]