New in Confluent Cloud: Making Data & Pipelines Accessible for AI-Ready Streaming | Learn More

May 22, 2024Read Time: 10 min

Serverless Decoded: Reinventing Kafka Scaling with Elastic CKUs

Written By

Julie PriceSenior Product Manager, Confluent
Robin OngSenior Product Marketing Manager, Confluent

May 22, 2024Read Time: 10 min

Apache Kafka® has become the de facto standard for data streaming, used by organizations everywhere to anchor event-driven architectures and power mission-critical real-time applications. However, this rise has also sparked discussions on improving Kafka operations and cost-efficiency—streaming data is naturally prone to bursts and often unpredictable, resulting in inevitable variations in workloads and demand on your Kafka cluster(s). These variations can come in different forms:

Seemingly predictable spikes like a Black Friday surge or business hours peaks
Unpredictable spikes like a post going viral or an instant sellout of a hot ticket item
Intra-day peaks and troughs matching to user patterns like ordering food delivery at mealtimes or ride-sharing demand after an event
Rapid growth from adding new data streaming workload(s) or expanding your customer base

Regardless of the source, these variations in demand create challenges as you think about capacity planning, scaling, and cost management—this is where the benefits of serverless clusters kick in.

Sample profile for online food delivery player reveals spikes during mealtimes, resulting in a ~45:1 peak-to-trough ratio

Above is a sample workload pattern for a food delivery service that demonstrates the high ratio of peak-to-trough throughput that is common in many Kafka use cases. Your data streaming platform must be able to handle this variability in throughput and support bursts and growth seamlessly as they come, without negative impact to performance or availability. At Confluent, we often ask customers who were self-managing Kafka, “How did you handle the unpredictability in workloads on your clusters?” And the common answer is either “we struggled to” or “we over-provisioned”—neither of which are satisfactory solutions.

Capacity planning is challenging and results in wasted resources, high ops burden, and increased downtime risk compared to autoscaling, serverless clusters

In this blog, we'll use this online food delivery service demand profile to run through the costs and implications across three options for capacity management:

Resources are pre-provisioned based on anticipated peak demand (plus a buffer based on historical spikes).
Resources are pre-provisioned to match predictable demand during mealtimes and scaled down during non-mealtimes (assumes infrastructure provisioning and all the tasks required to scale a cluster up and down happen quickly enough to make this feasible, which comes with its own operational demands as trade-offs).
Resources instantaneously autoscale based on actual demand (via serverless clusters).

We'll compare the infrastructure, development and operations personnel, and downtime costs across the three scenarios, and see which line items autoscaling, serverless clusters can help with the most.

As we dive in, please also join us for a webinar on June 18 where we'll break down the key cost drivers of running Kafka, along with a product demo around how autoscaling clusters can optimize your time and resources.

Join the Webinar

Capacity management is tricky
How Enterprise and eCKUs help
Cost savings with autoscaling, serverless clusters
What's next
Appendix

Capacity management is tricky

Self-managing Kafka to meet the demands of fluctuating workloads requires time-consuming capacity planning, cluster sizing, and pre-provisioning of infrastructure.

And it’s important to get capacity planning right. Under-provisioning results in performance degradation, downtime risk, and pager alerts—all unacceptable for mission-critical workloads and often lead to lost revenue. Over-provisioning leads to wasted expenditures for idle resources with overall compute utilization in the single to low double digits.

Source: Jack Vanlightly

Attempting to match capacity at all times to dynamically changing workloads is virtually impossible. Why is it so difficult?

Scaling clusters up is a cumbersome process. New infrastructure resources must be provisioned which takes time—even if you’re in the cloud, you are subject to e.g., the time to spin up EC2 instances (hint: it’s not instant). Then come the even more demanding tasks of configuring each additional broker and rebalancing partitions to evenly distribute the workload for optimal performance. These pain points are where the benefits of serverless really shine—there are literally no brokers or instances to provision, but more on that in a bit.

Scaling down also requires similar management actions, but in reverse. Infrastructure resources need to be spun down to avoid paying unnecessary costs, brokers must be removed, and partitions must be rebalanced. For teams self-managing Kafka, this responsibility lies solely with you—and although these steps can be scripted, they are often done manually, which is time-consuming and prone to human error.

Operating Kafka on your own can be difficult—and it gets more challenging as you scale

Scaling infrastructure solely in anticipation of predicted spikes, like Black Friday traffic, may seem reasonable. However, this approach still demands upfront planning and provisioning of extra resources, often resulting in prolonged scaling pre-/post-spike and ongoing spend for excess capacity. Similarly, managing daily peaks during regular business hours by scaling up and down becomes impractical due to the effort involved. And this still doesn’t solve for the unplanned spikes in demand. Consequently, clusters are frequently provisioned for peak usage, leaving substantial unused capacity—and additional costs—for your fleet.

How Enterprise and eCKUs help

The recently introduced Enterprise clusters combine the unique benefits of elastic autoscaling with private networking so you can minimize operational burden while upholding strict security requirements. Along with all the benefits you expect from Confluent Cloud—99.99% uptime SLA, infinite storage, built-in resiliency, and a full ecosystem of enterprise-grade tools—Enterprise clusters provide a truly serverless experience, instantaneously autoscaling up to meet spikes in demand and back down during periods of low usage without reserving or pre-provisioning capacity.

The difference between a fully managed SaaS-like (serverless) experience vs. IaaS or PaaS

This means that Enterprise clusters are always right-sized for your workloads without user intervention. No more rigorous capacity planning, upfront sizing, configuring brokers, waiting for hardware to provision, managing partition balance, or paying for idle resources. Enterprise clusters automatically react to the workload and instantaneously right-size as your workload grows and shrinks.

What enables this? Powering these autoscaling capabilities are what we call Elastic Confluent Units for Kafka (eCKUs), enabled by our cloud-native Kora engine. Each eCKU represents a collection of capacity across multiple dimensions to support your workloads, including throughput, partitions, and client connections. The magic of eCKUs is that they are automatically and instantly allocated to your workload as they’re needed, only when they’re needed.

What does this mean, exactly? Let’s say your workload requirements across these dimensions fit within the capacity of 2 eCKUs when running in steady state. You add some new consumer groups, increasing the egress throughput over the total limit for 2 eCKUs. What will happen? If you were self-managing open source Kafka, you’d be hit with performance degradation and/or downtime. With Enterprise clusters, your workload is completely unimpacted. We automatically allocate any additional eCKUs needed and bill you for them as your workload scales. Once your workload cools off, the eCKUs are automatically deallocated and you will no longer be billed for them. It’s really that simple—no manual intervention is required.

Autoscaling Enterprise clusters are always right-sized to handle both predictable and unpredictable demand without manual sizing or provisioning

Autoscaling Enterprise clusters seamlessly handle bursts in workloads as they occur, without the need to overprovision to handle peak usage. This applies to both predictable and unpredictable demand. You no longer need to pay for unused capacity sitting idle, waiting for a potential spike, nor do you have to worry about under-provisioning resulting in poor latency or cluster downtime. No more wasted resources and no more pager duty calls – call it Enterprise’s win-win.

Cost savings with autoscaling, serverless clusters

Now let's take a look at how this translates into tangible savings by comparing infrastructure, development and operations personnel, and downtime costs across the three capacity management scenarios introduced above:

Scenario 1: Peak (provisioning for peak demand, including the unforeseen).
Scenario 2: Predictable Demand (infrastructure resources are scaled up to accommodate expected mealtime peaks and scaled back down during low demand periods).
Scenario 3: Autoscaling (serverless clusters autoscale to demand).

Note: For this workload, the predictable peak is ~650 MBps, the low is ~15 MBps, and the unpredictable peak due to unforeseen demand is 1,000 MBps. The resulting average over the time period is ~300 MBps.

Infrastructure

Before we dive in, we should note that we are focusing only on compute costs for the sake of simplicity, given networking and storage costs do not vary much across the three scenarios. In a prior blog, we outlined how networking, specifically cross-AZ traffic, can represent the majority of Kafka infrastructure costs, especially as throughput grows. These networking costs are based directly on the data transferred and do not vary based on how much capacity is provisioned. Similarly, with Tiered Storage (KIP-405) becoming available soon in open source Apache Kafka, we assume storage costs are equally optimized across the three scenarios.

To estimate compute costs, you take the number of nodes and multiply by the cost of running the machine type for each of those nodes. Determining the right number of brokers and optimizing your machine type can be challenging. We've chosen sample node counts and machine types here based on our experiences to support the above workload across each scenario.¹

Scenario 1: Peak – 1,250 MBps²

Kafka Component	Nodes	Machine Type	Hourly Rate	Annual Cost
Brokers	63	m5.xlarge	$0.192	$105,961
Quorum Controllers	5	m5.large	$0.096	$4,205
Total				$110,166

Scenario 2: Predictable Demand – Up to 760 MBps and Down to 220 MBps²

Kafka Component	Nodes	Machine Type	Hourly Rate	% of Time	Annual Cost
Brokers (Scale Up)	39	m5.xlarge	$0.192	67%	$43,730
Brokers (Scale Down)	12	m5.xlarge	$0.192	33%	$6,728
Quorum Controllers	5	m5.large	$0.096	100%	$4,205
Total					$54,662

Scenario 3: Autoscaling – Average 300 MBps³

Kafka Component	Nodes	Machine Type	Hourly Rate	Annual Cost
Brokers	15	m5.xlarge	$0.192	$25,229
Quorum Controllers	3	m5.large	$0.096	$2,523
Total				$27,752

While the actual required compute cost of the workload should in theory be equivalent to the compute cost in Scenario 3: Autoscaling, we assume 85% utilization in this scenario to allow for some buffer—given 100% utilization is unrealistic for performance and reliability reasons in practice. Therefore, the required compute cost for this workload is assumed to be $23,589 (85% of $27,752).

Applying the required compute cost of $23,589 to the overall cost in Scenario 1: Peak gives us a resource utilization of just 21%, which is consistent with our experiences serving customers who self-manage Kafka. It should come as no surprise that total compute costs are highest when provisioning for peak, as this results in (over)paying for resources that are highly underutilized.

Resource utilization improves to ~43% in Scenario 2: Predictable Demand, but this comes with additional operational overhead to match cluster capacity to the fluctuating demand of your workloads. It also opens you up to a higher risk of outages and performance degradation when there are the inevitable unpredictable spikes in demand. As with most engineering challenges, there is no free lunch. We'll talk more about both of these in the next sections.

With Scenario 3: Autoscaling, our serverless Enterprise clusters are always right-sized for your workload, and you only pay for the resources you use when you actually need them. This saves you ~50% on compute costs compared to Scenario 2: Predictable Demand, and ~75% compared to Scenario 1: Peak.

Development and operations personnel

Estimating total development and operations costs can be tricky as the amount of engineering time and resources spent on Kafka varies by environment, workload, scale, and organizational structure. Moreover, “Kafka” doesn’t appear as a line item on your bill or timecard, so how easy it is to quantify depends on how closely engineering time and activities are tracked. As discussed in our blog around the (hidden) cost of Kafka operations, we typically see the equivalent of two engineers responsible for cluster development and operations for a smaller production use case. At larger scales, we have seen streaming teams consist of at least seven to 10 engineers.

Note: Talent.com estimates the salary for a Kafka engineer in the United States to be ~$140K annually Factoring in benefits, bonuses, and stock packages, the fully loaded cost of employment is about 1.25 to 1.4 times higher than salary alone (we have assumed 1.25 in this calculation).

Given this is a larger workload for a business-critical use case (food orders for an online delivery player), we’ve assumed it takes the equivalent of four engineers who are responsible for ongoing Kafka development and management in Scenario 1: Peak. As infrastructure is provisioned for peak (the most common solution we hear from self-managing users), somewhat less engineering time is needed for continuous capacity management and scaling.

In contrast, in Scenario 2: Predictable Demand where compute utilization is (in theory) better optimized, there is the trade-off of taking on additional operational burden associated with scaling clusters up during mealtimes and back down during non-mealtimes to meet ever-changing demands (i.e., no free lunch, pun intended). Even sophisticated Kafka teams at LinkedIn and Lyft have documented much of the pain associated with scaling their Kafka clusters, running into issues around cascading controller failures, slow broker startup and shutdown times, or frequent instance failures. In our analysis, we've assumed this requires ~1.5x more engineering resources (or 6 FTEs total) compared to Scenario 1: Peak.

In Scenario 3: Autoscaling, capacity management and cluster scaling responsibilities are completely offloaded to Confluent as Enterprise clusters are serverless and autoscale to demand, which can save over 50% in engineering resources allocated to Kafka management. This also boosts development velocity by freeing up valuable engineering resources for business logic and innovation, rather than managing the underlying infrastructure.

Downtime

While downtime is not a "guaranteed" cost, when an incident does occur, the impacts to your business (and your engineering teams) are very real in the form of lost revenue, lost data, customer dissatisfaction, SLA penalties, and more. Regardless of whether an individual broker fails or an entire cluster goes down, valuable engineering resources are diverted to troubleshoot issues instead of focusing on other more strategic projects. Therefore, the best case scenario in these situations is that your “only” costs are the engineering resources paged to address e.g., instance and network partition failures, whereas the worst case is business critical, revenue-generating applications are negatively impacted.

Note: We have used a conservative cost of downtime estimate of $1,000 per hour. Many companies have found that a single hour of downtime can cost upwards of $100,000—this amount will vary depending on your use case(s).

In Scenario 1: Peak, we've assumed a baseline SLA of 99.5% (based on e.g., brokers running on EC2 with a 99.5% instance-level SLA), which translates to ~44 hours of downtime per year.

The risk of downtime goes up in Scenario 2: Predictable Demand as unpredictable spikes in demand are more likely to exceed the amount of resources provisioned, resulting in performance issues and/or cluster downtime. In our analysis, we've assumed this risk to be ~2x greater than Scenario 1: Peak. This translates to ~88 hours of downtime per year—equivalent to a 99% uptime SLA—which is still fairly generous for a self-managed Kafka deployment running at this scale.

Enterprise clusters come with an industry-leading 99.99% uptime SLA, which is equivalent to a risk of downtime of less than 1 hour per year. This is a ~50x reduction in downtime from Scenario 1: Peak, and a ~100x reduction from Scenario 2: Predictable Demand. What's more is that this uptime SLA covers not only the underlying infrastructure, but also Kafka performance, critical patches and bug fixes, security updates, and more—giving you not only downtime reduction, but also your engineering teams the priceless peace of mind free from pager duty fires.

What's next

As you can see, autoscaling clusters can save you 75% in infrastructure costs, over 50% in engineering time and resources, and lower your risk of downtime by ~100x. As the demand for Kafka grows in your organization, this savings can be upwards of $750,000 per year, as demonstrated in the above comparison. Regardless of your demand patterns, fully managed, serverless Enterprise clusters autoscale to your needs, providing you the most cost-effective solution with no effort required. This is particularly advantageous for streaming where demand is often unpredictable and spiky by nature.

We are continuing to innovate and add improvements to Enterprise clusters. In just a few months since these clusters were launched, we've already made them even more cost-efficient. Now available on both AWS and Azure, Enterprise clusters now have a 50% lower entry point (at 1 eCKU) and 55% lower ingress costs, making it easier and more affordable than ever to get started. Throughput limits have also been increased by 20% to provide more capacity per eCKU.

Continued innovation in our Kora engine will allow us to pass along more benefits and cost savings to our customers over time. Join us for a webinar with a full demo of these clusters along with Q&A on June 18, where we’ll also help you understand and demystify your Kafka costs. Register for the webinar today!

Join the Webinar

Appendix

Annual compute cost = Number of nodes * Hourly cost of machine type * 8,760 hours per year
Scenario 1: Peak and Scenario 2: Predictable Demand both assume ~15-20% headroom in throughput when provisioning, given: (a) running at or near 100% utilization is unrealistic due to performance and throttling concerns, and (b) when provisioning, users don’t actually know what the precise peak (predictable or unpredictable) will be.
Scenario 3: Autoscaling node counts have been included for illustrative purposes only to compare costs across scenarios. In practice, Enterprise clusters are serverless, i.e., have no brokers/instances to manage.

Julie is a senior product manager at Confluent, where she is focused on Kora scalability and elasticity. She has spent many years in the data space as a product manager and data engineer, helping customers across many industries from gaming to high tech to retail, architect and implement innovative data applications using technologies like Google BigQuery, Dataflow, SingleStore, and SAP HANA.
Robin is a senior product marketing manager at Confluent, where she is responsible for product pricing, packaging, go-to-market strategy, and analytics. Prior to Confluent, she worked at VMware where she focused on pricing and packaging and go-to-market initiatives for its infrastructure, storage, and disaster recovery offerings.