Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Tackling the Hidden and Unhidden Costs of Kafka

Written By

Confluent Cloud’s Path to Cost-Efficient Data Streaming across 30K+ Clusters

In the first two parts of our blog series, we analyzed the costs of operating Apache Kafka® across infrastructure and development & operations. In particular, we touched upon how low utilization and complex cluster operations are unfortunate realities that infrastructure teams must grapple with as they balance performance and reliability with cost efficiency. These are topics we’ve had to address ourselves in building our own Confluent Cloud service, especially as we looked to establish a viable cost structure while being able to confidently and efficiently scale our service to tens of thousands of clusters (and beyond).

For self-managed Kafka users, the risk of service degradation creates bias to overprovision resources to handle spikes or growth in demand, regardless of duration. For example, streaming workloads typically fluctuate in throughput during the day, often leading to low utilization outside of business hours and wasted expenses. Not to mention the engineering staff required to manage, maintain, and scale these clusters. With modern cloud-native architectures and strategies, we aimed to change this mindset and improve utilization and efficiency without compromising reliability.

While low utilization is a common reality for Kafka engineering teams, when speaking with prospective customers, it’s common to see cost comparisons based on theoretical benchmarks assuming unattainable utilizations approaching 100%. From our experiences, these comparisons are not only unrealistic (as mentioned in Part two, average utilization for self-managed clusters is just ~25%), but more importantly, they lead to wrong decisions, poorly designed systems, and higher costs down the road. For example, underestimating development & operations costs or focusing your efforts on optimizing compute costs when networking comprises over 50% of your infrastructure bill as you scale.

In this blog post, we’ll cover how we tackled these utilization and operational challenges in Confluent Cloud, resulting in a 3x improvement to resource efficiency, and how users can leverage our fully managed and complete data streaming platform to reduce their overall costs.

Table of contents

Our path to cloud-native cost efficiency

Today, Confluent Cloud is the largest multicloud data streaming platform on the planet with over 30 thousand clusters and 3+ trillion messages processed each day. We’ve invested over 5 million engineering hours into making our platform truly cloud-native. Confluent Cloud reduces the cost to run data streaming workloads by achieving efficient resource utilization (up to 3x compared to self-managing on your own), while reducing the complexity for customers to manage and scale their clusters.

However, this didn’t just happen overnight—from the beginning, we knew that our vision to offer the world’s foremost data streaming service across all three cloud providers would be an ambitious undertaking. From our experiences working with our self-managed customers and our roots of running Kafka as an internal service at LinkedIn, we knew that this goal could not be met by simply installing open source Kafka on virtual machines in the cloud.

We needed to build a foundation that could act as the basis for both a sustainable cloud and on-premises business. This required a fundamental redesign of virtually every piece of our stack, from the compute and storage layers to networking to the orchestration and billing tools and infrastructure—all while optimizing our costs so we could viably offer our service to customers at massive scale and across multiple clouds. We’ll share more details about our system internals and our cloud-optimized engine in the coming weeks.

For this blog, we’ll focus on three areas that enable Confluent Cloud to sustain higher resource utilization and reduce operational complexity, resulting in lower costs for customers while maintaining best-in-class reliability.

Let's start by understanding how Confluent Cloud lowers costs by leveraging multi-tenancy, serverless abstractions, and elasticity to improve overall service utilization and lower the infrastructure costs of running Kafka.

Multi-tenancy 

Multi-tenancy is the sharing of resources across a group of users or tenants to enable higher system utilization, rapid provisioning, and on-demand usage models. Multi-tenant systems are popular with other cloud services like AWS S3, Snowflake, and Google BigQuery because as the user base scales, the increased compute density is also shared among the users to improve overall system efficiency.

Confluent Cloud leverages multi-tenancy for our Basic and Standard clusters, which abstract users completely from sizing, provisioning, and capacity planning, while enabling customers to dynamically scale their throughput up and down on-demand (i.e., 100% utilized), with up to 99.99% uptime availability. Because Confluent Cloud runs at the scale of tens of thousands of clusters, we leverage multi-tenancy in combination with serverless abstractions and elasticity to effectively 3X the resource efficiency of a typical self-managed single-tenant cluster. This enables our customers with unpredictable workloads to eliminate overprovisioning of Kafka clusters that end up being idle ~75% of the time, and only pay for what they use.

A lot went into making Confluent Cloud multi-tenant that we won’t cover here, but it's important to understand that we strictly isolate tenants and implement guardrails like resource limits, cellular isolation, and quotas to maintain our high quality of service. As mentioned earlier, we’ll share more on Confluent’s cloud-optimized engine in the coming weeks.

Serverless abstractions

Confluent uses serverless abstractions across compute, storage, and networking to continually improve system utilization while saving customers time and complexity. Abstraction enables Confluent Cloud to choose the right infrastructure and optimize performance, cost, and utilization dynamically over time, without the need for customer intervention. 

In the first blog of this series, we showed how networking is often the largest Kafka-related infrastructure cost because data must flow across many cloud networking components and availability zones (AZs). Let’s briefly cover two of the ways Confluent improves network utilization and efficiency for customers: economies of scale and optimized network routing.

Confluent operates at a massive scale, enabling strategic partnerships with each cloud provider. These partnerships enable us to leverage lower resource and networking costs for our customers, and access exclusive networking features and API integrations. We rely on these factors to optimize our networking and routing to improve network utilization while reducing overall network costs for customers.

We are also making improvements to optimize our customer’s network path. For example, follower fetching, as mentioned in the first blog of our series, is a strategy to reduce cross-AZ traffic when consuming data from topics. However, the Kafka follower fetching implementation had some key stability gaps, which required additional engineering and hardening for optimal performance and compatibility with Confluent’s cloud-optimized engine. Currently, follower fetching is available via early access on Confluent Cloud, and we’re excited to see customers save additional networking costs with this feature.

Elasticity

Elasticity means customers can dynamically scale their infrastructure up and down to meet demand, while maintaining performance and reliability. Multi-tenancy and serverless abstractions help with building a highly scalable service, but to make Confluent Cloud truly elastic and enable higher efficiency, we needed to decouple our compute and storage layers. Confluent Cloud does this through our data balancing and cloud-native storage systems, both key components of our cloud-optimized engine.

When self-managing Kafka, scaling workloads means being able to redistribute and rebalance partitions effectively. Cluster balance is crucial for reliability, scaling, and utilization, but it is hard to get right and must be done frequently as workloads change. In contrast, Confluent Cloud continually monitors cluster load factors and periodically rebalances data partitions across our clusters based on forecasted trends to achieve optimal cluster balance and enable elasticity.

Optimal data balance and elasticity enable Confluent Cloud to highly utilize cloud infrastructure across both our Multi-tenant and Dedicated cluster offerings, all while providing the serverless benefits discussed above. The following chart shows how recent improvements to Confluent Cloud’s balancing algorithm reduce p99 cluster latency metrics by ~50% for a heavily utilized cluster. By reducing latency skew, we’re able to further increase the utilization ceiling for customers while maintaining reliability and performance.

Comparing latency skew across brokers pre- and post-balancing algorithm updates

The second element that enables enhanced elasticity is decoupling storage from compute using tiered storage. Tiered storage enables our clusters to offload older partition segments from local disks to more cost-effective cloud object stores. The entire tiering logic is completely transparent, and older event data can still be consumed through the standard Kafka APIs on Confluent Cloud, unlocking long-term retention use cases without the need for large underutilized, storage-bound clusters.

The following chart shows how Confluent Cloud reduces compute resources by up to 75%, versus a self-managed cluster, by scaling storage without needing to scale compute. For more details about Confluent’s storage strategy, check out our 10x Storage Blog.

Our cloud-optimized engine also powers our Dedicated clusters, for customers that require unshared infrastructure for their workloads. To provide elasticity, we give customers the ability to programmatically scale their clusters—both up and down—to fit their demand profile using our API, UI, or CLI. In fact, you can scale your Dedicated cluster both up and down with a single Confluent CLI command:

confluent kafka cluster update lkc-abc123 --cku 3

The following chart shows how a Dedicated cluster, using programmatic scaling, achieves higher utilization (up to 90% for this simple example) compared to provisioning the cluster based on peak throughput demand. The shaded purple section in the chart represents unused resources. Programmatic scaling in conjunction with other Dedicated cluster features, like Client Quotas, enable customers to implement shared services and multi-tenant models to further optimize resource utilization on their own.

As discussed in our second blog of the series, the development, operational, and ongoing maintenance costs associated with self-managing Kafka can be significant. Oftentimes, keeping Kafka performant and available creates additional costs—for example, running extra Kafka brokers or having multiple on-call engineers, which results in underutilized cloud resources and engineering staff. Confluent Cloud removes this burden by fully managing the entire data streaming stack to reduce complexity and increase availability for our customers.

With almost a decade of helping customers run mission-critical data streaming workloads, we understand every aspect of running Kafka at scale. Confluent Cloud was designed to address these complexities and enable customers to focus on building great streaming applications rather than operating the underlying Kafka infrastructure. The following graphic shows how Confluent Cloud provides a fully managed and serverless experience across the entire stack compared to self-managing Kafka and other competitors.

A fully managed service must also be reliable—to reduce the probability and costs associated with downtime, Confluent Cloud uses real-time monitoring and automated health checks deployed across our entire fleet to evaluate the health of each cluster. We’re able to detect failure scenarios and trigger automatic mitigation to prevent drops in availability. For data integrity, we perform automatic durability audits that check our data and metadata states for correctness across trillions of messages each day. This helps us catch durability lapses before they reach customers, for example avoiding data loss due to replica divergence or log start offset race conditions.

We understand the importance of making it easy and efficient for engineering teams to operate mission-critical software, because we’ve had to do it ourselves with our own Confluent Cloud service. We’ve heavily invested in our internal systems and automation to provide a scalable and reliable experience for our SREs. To put our work into perspective, we typically have less than 5 on-call engineers managing and responding to Confluent Cloud related issues at any given time across over 30 thousand clusters. This often comes as a surprise to our self-managed users, where it’s typical for them to have at least one on-call engineer maintaining just a handful of clusters.

All of these investments create a significantly more reliable and accessible service that removes the complexities of operating Kafka, enabling your engineering teams to be better utilized and move up the stack to more strategic projects. Confluent Cloud offers a 99.99% uptime SLA because of the deep investments in software-driven operations and automation we’ve made across every layer of the stack to maximize availability and performance without sacrificing cost efficiency. Customers choose Confluent because it reduces the additional effort and cost required to achieve the same level of reliability vs. self-managing Kafka on their own in-house.

Part two in our series covered how Kafka is a foundational element for implementing any data streaming use case, but not sufficient by itself—especially as you scale. Connectors, security and governance features, stream processing, and geo-replication tools all have to be built and integrated with Kafka. Naturally, this comes with significant cost in the form of allocating engineering time and resources on developing and maintaining infrastructure tools, rather than the critical streaming applications and pipelines the clusters are meant to support.

We’ve invested heavily in building a complete data streaming platform with a full suite of features to help teams quickly and securely implement streaming use cases end-to-end (and remove the costs associated with ongoing maintenance). Confluent provides tools like 120+ pre-built connectors, a portfolio of stream processors, data governance capabilities specifically designed for streaming data, and more. In addition to reducing your cost structure, these features accelerate the development cycles required to move your streaming use cases into production.

We also understand that ongoing enterprise support is critical. There are many interconnected components of a data streaming architecture, and we work with thousands of customers everyday to ensure their success across Kafka clients, network topologies, and all the services that interact with Confluent. Personalized support from our data streaming experts is among the many reasons our customers—from new startups to the largest banks in the world—trust Confluent with their mission-critical workloads and data.

So what’s the most cost-effective data streaming solution?

As we’ve covered throughout our blog series, evaluating the costs of running Kafka requires a more holistic approach than just looking at the cost of the software and comparing theoretical benchmarks. You must also consider the costs of building a complete platform and how (in)efficiently your infrastructure resources and engineers will be utilized to operate and maintain it going forward.

Confluent Cloud offers a fully-managed, cloud-native experience that is fundamentally more cost-effective because of the efficiencies and completeness of our platform. Architecting our service to support multi-tenancy, decouple infrastructure layers, enhance elasticity, and abstract away day-to-day management has enabled thousands of customers to optimize their spend on data streaming—and focus on their key initiatives instead of running and maintaining the underlying Kafka infrastructure.

If you’d like to see how your organization can achieve similar cost savings, be sure to check out our cost calculator. To experience our cloud-native service, sign up for Confluent Cloud today!

  • Chase Thomas is a Group Product Manager at Confluent where he focuses on Apache Kafka. Prior to Confluent he held product roles at Splunk and at AWS Managed Streaming for Apache Kafka (MSK). He started his career building real-time instrumentation systems on dams across California. Chase has an MS-MBA from BYU and the Marriott School of Business. In his free time, you’ll find Chase fishing in the outdoors with his family.

  • Kevin Chao leads product go-to-market strategy as part of the Product Marketing team at Confluent. Prior to Confluent, he worked as a design engineer at Nvidia and Advanced Micro Devices (AMD), before working at McKinsey to help various B2B tech companies on their product and go-to-market strategies.

Did you like this blog post? Share it now