[Webinar] Deliver enterprise-grade Apache Kafka® to your customers | Join Now

Turbo-Charging Confluent Cloud To Be 10x Faster Than Apache Kafka®

作成者 :

At Current 2023, we announced that Confluent Cloud is now up to 10x faster than Apache Kafka®, thanks to Kora, The Cloud-Native Kafka engine that powers Confluent Cloud. In this blog post, we will cover what that means in more depth.

Running an “apples-to-apples” performance comparison between Apache Kafka and the fully managed Confluent Cloud can be tricky. For example, Confluent Cloud has additional essential services (e.g., observability, durability auditing, billing, logging—see the award-winning VLDB paper on Kora for more details) running in the background on each node of our fully managed SaaS, keeping our 30K+ clusters durable, secure, and highly available. These functions consume resources that have no equivalent in the open source platform, so the comparison slightly penalizes Confluent Cloud. Kora is playing with a hand tied behind its back, but we saw this as just another challenge on the road to deliver better performance to our customers. 

Building and improving Kora is a continuous process, but now we are happy to share that Confluent Cloud customers can:

  • Get their data to where it’s needed up to 10x faster with more predictable p99 latency across various workload profiles

  • Keep latency reliably low even under cloud provider infrastructure or service disruptions with Kora’s automatic monitoring and mitigation (no paging or any on-call involvement needed!)

  • Enjoy continuous performance optimizations without lifting a finger through seamless upgrades and automatic tuning

Figure 1: Confluent Cloud shows more than 10x tail latency improvements over Apache Kafka at 5.6GBps aggregate throughput (1.4GBps ingress + 4.2GBps egress)

The remainder of this blog post is structured as follows. 

  1. First, we will walk through how we measure latency in Confluent Cloud and the need to improve latency for our customers both under steady state and in more real-world conditions (exposed to external factors such as workload changes, infrastructure degradation, and maintenance upgrades). 

  2. Second, we will share our learnings to improve steady-state latency and benchmarks comparing Confluent Cloud and Apache Kafka. 

  3. Finally, we will conclude with our learnings from improving latency fluctuations due to various external factors and show real customer examples of where this has shown benefit. 

Measuring latency in Confluent Cloud

The first step in reasoning about performance is deciding what to measure and how to do it. Benchmarks often measure performance in a lab environment with synthetic workloads, which may not fully reflect what customers experience in production. While truly observing customers’ experience requires client-side metrics (which we hope KIP-714 will address), we use health check services as a close approximation in Confluent Cloud. Health checks are part of Kora and rely on Apache Kafka producers and consumers to periodically produce and consume messages through the same path the user’s data would travel through. These services are used to measure end-to-end (E2E) latency—the aggregated time it takes for a producer to send a message and the consumer to read it—and to alert us when the observed latency breaches our internal SLOs. 

Figure 2: E2E latency measurement in Confluent Cloud health checks

Having run a managed cloud SaaS offering for many years observing customer requests and our internal SLOs week over week, it is abundantly clear to us that there is a distinction between optimizing for latency under steady state vs latency fluctuations caused by external factors (e.g., due to network jitters, noisy neighbors, managed cloud service degradations, zonal failures, software upgrades, workload fluctuations, etc.). While most performance benchmarks focus on showcasing improvements to steady-state latencies, in the real world it’s latency fluctuations that often cause significant disruptions to customers’ true performance experience. We invested in improving both to make sure Confluent Cloud’s latency is low, and stays consistent, in any scenario. 

Now let’s get to what you really want to see—benchmarks against Apache Kafka, and our learnings from the results. 

Steady-state benchmark results: Confluent Cloud (Kora) vs. Apache Kafka

We benchmarked many of the workload patterns we see in Confluent Cloud. Let’s showcase a few examples of the results we saw:

  • With the same hardware setup, Confluent Cloud’s Kora engine outperforms Apache Kafka by up to 16x, as measured by p99 E2E latencies, across various workload profiles with aggregate throughput (ingress+egress) from 30 MBps to 5 GBps+

  • Confluent Cloud’s latencies stay low and more stable as throughput and partition scales, while Apache Kafka’s performance degrades significantly at heavier loads

  • Confluent Cloud’s performance is more predictable even at tail percentiles

Note
Keep in mind that while benchmarks are a guiding light, your mileage may vary depending on your workload. Our commitment is to continuously improve Kora and seamlessly deliver these improvements behind the scenes focusing on what our customers run. If you encounter scenarios where Confluent Cloud isn't meeting your performance standards, we invite you to inform us so that we can work towards innovating on your behalf. 

Benchmark setup

We ran our benchmarks in Confluent Cloud’s dedicated offering which abstracts the underlying infrastructure in terms of CKUs (Confluent Unit for Kafka). CKUs provide consistent cross-cloud limits on various dimensions of Apache Kafka so that users don’t have to think of underlying hardware but instead think in terms of their workload to CKU mapping and scale up/down their CKUs as necessary. For the purpose of this benchmark, we tested the dimensions of ingress/egress and partition limits of Confluent Cloud against similarly resourced Apache Kafka clusters at 2 different scales to have workload variations across throughput and number of topic-partitions. Though we also tested different fanouts, keyed workloads, and different producer/consumer counts which can impact the latency experienced, we won’t list all of them here for brevity. In a majority of those use cases, Confluent Cloud outperforms Apache Kafka, whereas in other cases there is more work cut out for us. The limits for the different scales we tested are as follows. 

CKU

Ingress limit

Egress limit

Partitions limit

2

100 MBps

300 MBps

9000

28

1400 MBps

4200 MBps

100000 

Figure 3: Confluent Cloud’s CKU limits

To compare against Apache Kafka, we provisioned a similar hardware setup as that of Confluent Cloud (in AWS) with an equivalent number and type of broker compute, EBS storage, and client instances. The number of brokers per CKU and the size of the instance type also vary quite a bit within Confluent Cloud—we use various instance types to satisfy the limits and abstractions at various CKU counts. We also increased the disk size, IOPS, and throughput of EBS for Apache Kafka to be on par with Confluent Cloud for the tests. We also ensured that there were not any client bottlenecks by scaling up the clients along with the number of CKUs being tested. 

All tests were run on Apache Kafka 3.6 with KRaft. The KRaft controllers were running in three separate instances. To compare against higher CKU counts in Confluent Cloud, the Apache Kafka server-side configs for the number of replica fetchers, network threads, and I/O threads were increased to num.replica.fetchers=16, num.network.threads=16, num.io.threads=16. Using default configs would have made the Confluent Cloud improvements much higher. 

Unless otherwise noted, our benchmarks used the following client configurations and other specifications:

Configuration/Workload specification

Setting

acks

all

Fanout

1:3 (1 producer per topic, 3 consumer groups per topic, 1 consumer per group)

Inter-broker SSL

enabled

Message size 

2 KB

linger.ms

10ms

batch.size

16 KB (Java client default)

Warmup time

30 minutes

Test duration

1 hour

Figure 4: Default configuration and workload specification for benchmarks in this blog post

We invite you to try out these workloads and configs for any of our tests on Open Messaging Benchmark in the eu-central-2 AWS region in Confluent Cloud where this is available by default. 

Test 1: Consistent latency improvement as throughput scales

We started the benchmarking exercise by examining Confluent Cloud’s performance consistency as throughput scales at fixed partition counts. By achieving stable latency profiles across different levels of loads, we help ensure latency is consistent and predictable whether users are supporting small workloads or handling massive traffic spikes like Black Friday events.

We experimented across 2 different CKU counts, 2 and 28 CKUs, which correspond to different customer use cases we see in Confluent Cloud in their data streaming evolution. We varied the throughput up to the limit each CKU supports, keeping the total number of partitions at each CKU constant (partitions per topic were always constant at 200). The throughput varied from 10 MBps to 1.4 GBps ingress (which is 30 MBps to 5.6 GBps aggregate). We see that Confluent Cloud significantly outperforms Apache Kafka as throughput scales across the different CKUs with up to 12x improvements.

Figure 5: Confluent Cloud’s latency remains consistent as throughput scales and performance advantages expand to 12x

CKU

Throughput (ingress)

Apache Kafka p99 E2E latency

Confluent Cloud (Kora) p99 E2E latency

Percentage improvement

2 CKU 


(9000 partitions, 45 topics)

10 MBps

69 ms

42 ms

+64%

50 MBps

110 ms

60 ms

+83.3%

100 MBps

598 ms

242 ms

+147.1%


28 CKU 


(100000 partitions, 500 topics)

250 MBps

41 ms

25 ms

+64%

500 MBps

45 ms

26 ms