At Current 2023, we announced that Confluent Cloud is now up to 10x faster than Apache Kafka®, thanks to Kora, The Cloud-Native Kafka engine that powers Confluent Cloud. In this blog post, we will cover what that means in more depth.
Running an “apples-to-apples” performance comparison between Apache Kafka and the fully managed Confluent Cloud can be tricky. For example, Confluent Cloud has additional essential services (e.g., observability, durability auditing, billing, logging—see the award-winning VLDB paper on Kora for more details) running in the background on each node of our fully managed SaaS, keeping our 30K+ clusters durable, secure, and highly available. These functions consume resources that have no equivalent in the open source platform, so the comparison slightly penalizes Confluent Cloud. Kora is playing with a hand tied behind its back, but we saw this as just another challenge on the road to deliver better performance to our customers.
Building and improving Kora is a continuous process, but now we are happy to share that Confluent Cloud customers can:
Get their data to where it’s needed up to 10x faster with more predictable p99 latency across various workload profiles
Keep latency reliably low even under cloud provider infrastructure or service disruptions with Kora’s automatic monitoring and mitigation (no paging or any on-call involvement needed!)
Enjoy continuous performance optimizations without lifting a finger through seamless upgrades and automatic tuning
The remainder of this blog post is structured as follows.
First, we will walk through how we measure latency in Confluent Cloud and the need to improve latency for our customers both under steady state and in more real-world conditions (exposed to external factors such as workload changes, infrastructure degradation, and maintenance upgrades).
Second, we will share our learnings to improve steady-state latency and benchmarks comparing Confluent Cloud and Apache Kafka.
Finally, we will conclude with our learnings from improving latency fluctuations due to various external factors and show real customer examples of where this has shown benefit.
The first step in reasoning about performance is deciding what to measure and how to do it. Benchmarks often measure performance in a lab environment with synthetic workloads, which may not fully reflect what customers experience in production. While truly observing customers’ experience requires client-side metrics (which we hope KIP-714 will address), we use health check services as a close approximation in Confluent Cloud. Health checks are part of Kora and rely on Apache Kafka producers and consumers to periodically produce and consume messages through the same path the user’s data would travel through. These services are used to measure end-to-end (E2E) latency—the aggregated time it takes for a producer to send a message and the consumer to read it—and to alert us when the observed latency breaches our internal SLOs.
Having run a managed cloud SaaS offering for many years observing customer requests and our internal SLOs week over week, it is abundantly clear to us that there is a distinction between optimizing for latency under steady state vs latency fluctuations caused by external factors (e.g., due to network jitters, noisy neighbors, managed cloud service degradations, zonal failures, software upgrades, workload fluctuations, etc.). While most performance benchmarks focus on showcasing improvements to steady-state latencies, in the real world it’s latency fluctuations that often cause significant disruptions to customers’ true performance experience. We invested in improving both to make sure Confluent Cloud’s latency is low, and stays consistent, in any scenario.
Now let’s get to what you really want to see—benchmarks against Apache Kafka, and our learnings from the results.
We benchmarked many of the workload patterns we see in Confluent Cloud. Let’s showcase a few examples of the results we saw:
With the same hardware setup, Confluent Cloud’s Kora engine outperforms Apache Kafka by up to 16x, as measured by p99 E2E latencies, across various workload profiles with aggregate throughput (ingress+egress) from 30 MBps to 5 GBps+
Confluent Cloud’s latencies stay low and more stable as throughput and partition scales, while Apache Kafka’s performance degrades significantly at heavier loads
Confluent Cloud’s performance is more predictable even at tail percentiles
We ran our benchmarks in Confluent Cloud’s dedicated offering which abstracts the underlying infrastructure in terms of CKUs (Confluent Unit for Kafka). CKUs provide consistent cross-cloud limits on various dimensions of Apache Kafka so that users don’t have to think of underlying hardware but instead think in terms of their workload to CKU mapping and scale up/down their CKUs as necessary. For the purpose of this benchmark, we tested the dimensions of ingress/egress and partition limits of Confluent Cloud against similarly resourced Apache Kafka clusters at 2 different scales to have workload variations across throughput and number of topic-partitions. Though we also tested different fanouts, keyed workloads, and different producer/consumer counts which can impact the latency experienced, we won’t list all of them here for brevity. In a majority of those use cases, Confluent Cloud outperforms Apache Kafka, whereas in other cases there is more work cut out for us. The limits for the different scales we tested are as follows.
To compare against Apache Kafka, we provisioned a similar hardware setup as that of Confluent Cloud (in AWS) with an equivalent number and type of broker compute, EBS storage, and client instances. The number of brokers per CKU and the size of the instance type also vary quite a bit within Confluent Cloud—we use various instance types to satisfy the limits and abstractions at various CKU counts. We also increased the disk size, IOPS, and throughput of EBS for Apache Kafka to be on par with Confluent Cloud for the tests. We also ensured that there were not any client bottlenecks by scaling up the clients along with the number of CKUs being tested.
All tests were run on Apache Kafka 3.6 with KRaft. The KRaft controllers were running in three separate instances. To compare against higher CKU counts in Confluent Cloud, the Apache Kafka server-side configs for the number of replica fetchers, network threads, and I/O threads were increased to num.replica.fetchers=16, num.network.threads=16, num.io.threads=16
. Using default configs would have made the Confluent Cloud improvements much higher.
Unless otherwise noted, our benchmarks used the following client configurations and other specifications:
We invite you to try out these workloads and configs for any of our tests on Open Messaging Benchmark in the eu-central-2
AWS region in Confluent Cloud where this is available by default.
We started the benchmarking exercise by examining Confluent Cloud’s performance consistency as throughput scales at fixed partition counts. By achieving stable latency profiles across different levels of loads, we help ensure latency is consistent and predictable whether users are supporting small workloads or handling massive traffic spikes like Black Friday events.
We experimented across 2 different CKU counts, 2 and 28 CKUs, which correspond to different customer use cases we see in Confluent Cloud in their data streaming evolution. We varied the throughput up to the limit each CKU supports, keeping the total number of partitions at each CKU constant (partitions per topic were always constant at 200). The throughput varied from 10 MBps to 1.4 GBps ingress (which is 30 MBps to 5.6 GBps aggregate). We see that Confluent Cloud significantly outperforms Apache Kafka as throughput scales across the different CKUs with up to 12x improvements.