Apache Kafka® is the de facto standard for event streaming today. The semantics of the partitioned consumer model that Kafka pioneered have enabled scale at a level and at a cost that were previously unobtainable. Today, the cost to stream data is so low that we can now afford to widen the funnel of data we use as input to our analytics and operations. And that data volume continues to grow. Increasingly, the challenge is to scale traffic levels elastically without downtime.
Expanding without downtime is hard. In distributed systems, there are no magic bullets or cures, there is only hard work. Doing that work reliably and safely is what makes systems dependable for even the most critical workloads. While Kafka has become a mature, stable platform for event streaming, there is still a lot happening behind the scenes. When you need to elastically scale, it gets even more complicated. This blog post will look at the inner workings of Confluent Cloud when you expand your workload.
As part of our elastic theme for Project Metamorphosis, which is announcing a series of improvements to Confluent Cloud, we’re excited to share with you a demo that shows a Confluent Cloud Dedicated cluster scaling to over 10 GBps of aggregate throughput in seven clicks. Specifically, this is 2.7 GBps of ingress and 8.1 GBps of egress. At this rate, the stream is carrying a terabyte of data every 96 seconds. While the numbers may be impressive, what’s more impressive is the ability to make this change without any downtime.
At its core, Kafka is an event streaming platform built around the concept of a distributed log: the topic. The topic itself is a logical construct that is a collection of partitions. Each partition stores a subset of the topic. Those partitions live on brokers. Generally, each partition has replicas stored on other brokers to improve availability. In Confluent Cloud, there are always at least three replicas (more on that later). At all times, one of these partition replicas is the leader, and there is only one leader per partition—which means that at any given moment, the leader exists on a specific broker (i.e., machine). If a broker needs to reboot or if the traffic changes, then the partitions need to be moved between brokers to keep even traffic across the cluster. For more about partitioning and placement, see Auto Data Balancing.
The breakthrough concept of Kafka was to scale out horizontally via partitions rather than scale up vertically. Horizontally means adding more machines; vertically means using bigger machines. This means a properly crafted workload will pretty much scale out linearly in Kafka. That’s how our demo got to 10 GBps of aggregate throughput so easily: 2.5 GBps ingress and 7.5 GBps egress. If you were trying to do that through one machine, you would have a very difficult time because at some point you’ll run out of bigger machine options. This is why supercomputers shifted to a scale-out model over the last two decades.
Confluent Cloud is software as a service (SaaS) in the sense that there truly is no infrastructure for you to manage. This is different from “managed services” or “managed platforms” that are infrastructure automations. As we walk through what happens behind the scenes in this demo, that will become more apparent. To be clear, there is still infrastructure, just as serverless still uses servers, but Confluent is automating and orchestrating all that infrastructure for you. We do this by running our cloud on Kubernetes, using our very own Confluent Operator, which is commercially available.
When you create a Dedicated (single-tenant) cluster in Confluent Cloud, we are creating a Kafka cluster just for you and all the automation that goes into that is generally what you would expect from a cloud provider. We’re leveraging the infrastructure capabilities of the cloud provider you have chosen to run Confluent Cloud in to accomplish this: Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP).
We build these clusters around the concept of a CKU, a Confluent Unit for Kafka, which is a unit of capacity that you can compose into clusters of any size. A CKU is a minimum size of cloud resources necessary to run a Kafka cluster. CKUs encapsulate Virtual Machines (VMs), disks, storage, and other resources into a single, stackable package. CKUs include guidelines and limits for resource utilization like the maximum number of partitions, connections, and other resources that will impact the performance of a Kafka cluster. Clusters can be 1 to N CKUs. At first, it might not seem that different from a hosted or managed service, but as you use the Kafka cluster, this is where things begin to change.
As you create topics in Confluent Cloud rather than explicitly placing the partitions on brokers or relying on a random placement strategy, Confluent Cloud places the partitions intelligently. Initially, this is easy: The cluster is empty. As you add more topics and partitions, this becomes more complicated, and as your workload changes over time, it becomes even more so. This is where the SaaS nature of Confluent Cloud really starts to become apparent: All of this placement is done for you automatically and in a way that minimizes both the time necessary to scale and the impact of scaling operations.
As your workload grows over time, you fill the capacity of the brokers in your Kafka cluster. You must then add brokers to the cluster to handle additional traffic. There are really several parts to this process, and we’ll cover each one separately.
Outside of Confluent Cloud, this can be a tricky question to answer. You need to understand how well you’ve been load balancing your cluster, what the periodicity of your traffic is, and any sort of skew in your partitioning schemes. You then basically work out how much traffic each broker is capable of sustaining, with headroom to survive failures, and then divide your expected traffic increase by the per-broker throughput you’re achieving. All this depends on the hardware and configuration of the brokers you’re adding as those choices will determine where bottlenecks develop in your stream. This is even more complex because as you scale some components like compute or network, other aspects like Input/output operations per second (IOPS) can become bottlenecks.
Although every workload is different, you can generally think of a CKU as facilitating 50 MBps of a well-balanced and well-written Kafka workload. The metrics in Confluent Cloud will help you see how your workload is performing. These metrics are available both in the UI and via an API. If your workload of 30 MBps is running well on one CKU and you want to double it, simply add a second CKU. If your workload of 400 MBps is running steadily on 10 CKUs and you want to add another 20% capacity to your cluster, add two CKUs. See how this works in the demo, and now let’s dive a little deeper into exactly what is happening.
First, the CKUs you add are also adding their resources, brokers, disks, storage, etc., to the existing cluster. This fast provisioning of cloud resources is what the promise of the cloud is all about. On Confluent Cloud, we leverage the underlying services of the cloud provider you have chosen to acquire these resources. Confluent Operator, the very same enterprise-ready implementation of the Kubernetes Operator API available in Confluent Platform, makes our service very cloud portable.
Beyond the brokers, there are significant networking and configuration steps that take place, but none of those are exposed to customers. The combination of cloud and Kubernetes make this part considerably easier than it would have been either in your own datacenter or just with VMs alone, but it is only part of the story. After this is where the real hard work in Kafka begins.
As mentioned earlier, data resides in partitions and partitions live on brokers. Once the new infrastructure is ready for traffic, some of the partitions need to be moved from the old brokers to the new brokers. Figuring out which partitions to move and where is really a careful balance. Your end state is a well-distributed workload, but getting there is complex. And you generally don’t want to turn off the cluster to do this. This is sort of like adding cylinders to your car’s engine as you’re driving up a mountain.
The entire operation revolves around a rebalancing plan. You need to identify which brokers are under the most stress and move some of their workload off of them. You accomplish this by incrementing the in-sync replicas (ISR) count for the topic and placing the new partition replicas on the new brokers. This is important because if you just try to force a failover, you will block writes to the topic if the ISR falls below the minimum ISR level specified. Give them some time and they will synchronize the data. After they have synchronized with the new replicas, you can begin the failover process and make one of the replicas on a different broker the leader.
Depending on the conditions of the workload this may or may not be the new broker, but the result will be a step towards a more evenly distributed load. Now you move to failover the next replica. Then you repeat this process for every partition in every topic, stopping along the way to see how the broker load balancing is working out, since you don’t want to overload the new brokers. Much of what we’ve learned in this space has been learned with our friends at LinkedIn on CruiseControl and over our years of building Confluent Platform.
The basic algorithm for this process is as follows:
For each broker Identify candidate partitions to move based on broker utilization Identify topics they belong to For each topic identified Increment ISR For each partition Place new replica on new broker Change leader for partition* Decrement ISR
The actual algorithm is a bit more complex and includes many optimizations including batching and capabilities we added into Kafka with KIP-455. Along the way, you need to stop and check the resource utilization on the brokers to make sure that you’re evenly distributing the workload as the traffic is literally shifting as you do this.
You may have noticed the
* in the list above—this is the step at which the TCP connection to the leader is going to be closed. The Kafka client handles this gracefully by default and clients will reconnect quickly to the new leader, but changing some defaults can make that more difficult. As a result, we recommend downloading configs from the tools and client configuration area of our portal.
In Confluent Cloud, this entire process is completely automated and happens behind the scenes. You simply select how many CKUs you want in your cluster and click OK. Make no mistake, the rebalance is still happening, but there is no work for you to do. Confluent Cloud is also designed to minimize disruption while the process is taking place. How long it takes depends on your data volume, both flowing and stored, so it varies by cluster. New capacity comes online quickly, in a rolling process, so you can alleviate cluster pressure almost immediately and we even do some of the operations in parallel to decrease the time necessary for rebalancing. Over time as the rebalancing progresses more and more, capacity is rolled online and you can utilize the full extent of the CKUs you added.
The vision of Confluent Cloud is to take all of these advanced capabilities and deliver them to you in a seamless experience as an event streaming software as a service (SaaS), rather than merely hosted Kafka. Give Confluent Cloud a try yourself to see the difference. Use the promo code CL60BLOG to get an additional $60 of free Confluent Cloud usage.*
Over the next few months, we’ll continue outlining other ways we’re bringing event streaming SaaS to the market, but it is worth adding that the same innovation and technology we develop in Confluent Cloud also goes into Confluent Platform, providing the same streamlined and automated expansion experience in your own datacenter as well as a cloud-like SaaS experience in the cloud, on premises, or in hybrid configurations.
If you haven’t already, check out the demo to see Apache Kafka scale to 10 GBps in Confluent Cloud with only seven clicks. You can also check out future demos on the Elastic page dedicated to this announcement.