Optimizing costs and increasing efficiencies are on everyone’s mind right now. Interest rates are up, inflation is soaring, and a recession is looming. As a result, this might be the first time in many years that you’ve had to rationalize and justify your cloud costs. Your Apache Kafka® related spend is likely no exception.
At Confluent, we’ve worked with thousands of customers—across both our fully managed cloud service and self-managed software—to help them understand and right-size their Kafka-related spend to support all of their data streaming workloads. This experience, along with managing our own costs across operating tens of thousands of Kafka clusters in Confluent Cloud, has provided us with a deep understanding of the key cost drivers and optimization levers for Kafka.
Unfortunately, properly assessing the cost of running Kafka can be a difficult task. We often see analyses that focus solely on the compute and storage resources required to run the platform, but you may find it surprising that networking is actually Kafka’s biggest infrastructure cost and is too often ignored. Moreover, these analyses tend to overlook all the development and operations personnel costs that are harder to quantify, but are very real to your team.
That’s why we’re excited to kick off a four-part blog series where we’ll help you understand and optimize the costs of running Kafka. The first two blogs will focus on how to assess the key drivers of Kafka costs: (a) infrastructure and (b) development & operations resources. We’ll then shift to how Confluent has rearchitected Kafka to be a lot more efficient, before closing with a few thoughts from our founder.
In this first blog, we’re going to run through the infrastructure costs of running Kafka—i.e., compute, storage, networking, and the additional tooling you need to keep Kafka up and running smoothly. We won’t bury the lede—if you’re running Kafka in the cloud across multiple AZs (as most do for high availability), networking likely represents over 50% of your Kafka infrastructure costs. Let’s see how this ends up being the case.
The above chart shows the breakdown of infrastructure costs for both a smaller (20/60 MBps, ingress/egress) and a larger workload (100/300 MBps, ingress/egress) for comparison purposes. It assumes the workload is running on AWS across three availability zones, and data is retained for seven days. We’re setting aside utilization—but more on that later. It also ignores discounting, but keep in mind that discounts typically scale as you spend more with the cloud provider—they do for Confluent Cloud.
Before we get into the details of the analysis, we should clarify upfront that we are omitting dozens of infrastructure components for the sake of simplicity. This includes things like load balancers, NAT gateways, Kubernetes clusters, monitoring solutions, and so on. However, these items are very real to ensure you have a production-ready Kafka environment—and are items that will only add to your infrastructure-related bills. From our experience, this can increase your infrastructure costs of self-managing Kafka by up to 25%. This analysis is also only calculating the costs of a single cluster, but you’ll probably have several across environments.
Therefore, this analysis is likely underestimating what you may end up spending to run and support Kafka on your own.
Note: For those interested in doing their own cost analysis, the formulas behind our calculations are listed below in the Appendix.
Let’s start with compute, which is usually the first place people look for savings despite only representing a small portion of infrastructure costs. This instinct stems from the pre-cloud world where compute infrastructure in general was hard to scale and was often coupled with storage.
Simple enough, right? Of course, the real world is much more complicated. Finding the right number of brokers and optimizing your machine type for each workload can be quite challenging—we’ve chosen sample node counts and machine types here for illustrative purposes only.
We’re also ignoring a few Kafka components that are mission-critical for most use cases, such as Kafka Connect and Kafka Streams, which have their own infrastructure costs.
Storage is a bit more involved to properly calculate, because the pricing structure takes into consideration things like IOPs and throughput. In an effort to simplify things, we’re going to ignore those line items and just focus on local EBS storage costs. If you scale your clusters vertically or start to get into serious throughput loads, however, you will need to look carefully at extra IOPs and throughput costs.
Storage needs (GB)
In Confluent Cloud, we have been running our cloud-first storage engine, which leverages tiered storage, for years now. We audit every message for data integrity across the storage tiers. For a deep dive into Confluent’s durability auditing service, which proactively detects data integrity issues on trillions of Kafka messages per day, check out this blog post.
One other thing to note on storage—using instance-based volumes is questionable when running Kafka or a Kafka clone on Kubernetes. We have seen reliability and data loss issues when attempting to use instance-based volumes, so we recommend against it.
To this point, we’ve set aside the need to overprovision resources to account for variability in your workload(s). However, to ensure reliability and performance in the real world, compute resources need to be overprovisioned to protect against unexpected throughput surges, while storage resources need to be overprovisioned to avoid running out of disk space.
An unfortunate (but necessary) reality is that most self-managed infrastructure runs at very low utilization (sometimes as low as <20%), but those resources end up on your cloud bill regardless of their actual usage. That is driven in part by the difficulty of dynamically scaling resources up and down depending on the current workload without jeopardizing throughput and latency performance or risking cluster downtime.
Optimizing compute and storage utilization is crucial to managing your infrastructure costs. A lot of the engineering work we’ve done at Confluent is focused on helping our users improve their infrastructure utilization rate, such as:
Serverless clusters that let you autoscale
Tiered Storage to elastically scale storage and compute resources independently
Automated partition balancing to optimize broker performance and utilization
Client quotas for you to support multiple tenants/applications
Resource limits to ensure high availability and scalability
We’ll dive into these items more later on in our blog series.
Okay, here’s the big one. Kafka is a high throughput, low latency system so it shouldn’t be that surprising when you discover networking is the largest line item on the bill. Problem is, it’s a hard line to tease apart. Your cloud provider doesn’t say, “Here are your networking costs for Kafka.” No, the networking costs are all mixed up with all the other networking usage in your organization.
So what do we do? Well, let’s model it out.
Kafka can have several networking costs, but the one that really sticks out is cross-AZ traffic. Let’s assume we are running a multi-zone (three zones) cluster for availability and resiliency. Running Kafka in a single zone is not advisable—zonal outages inevitably happen, and downtime puts your business (and your nights and weekends) at risk.
There are three drivers of cross-AZ traffic:
Producers – a well-balanced cluster will place partition leaders across three zones, meaning that Kafka producers will write to a leader in another zone roughly two-thirds of the time
Consumers – the same concept holds for consumers; a well-balanced cluster will have Kafka consumers read from a partition leader in another zone roughly two-thirds of the time
Partition replication – assuming a replication factor of three (the default recommendation), leader partitions will need to replicate messages to follower partitions in two separate zones
Kafka Ingress (MBps)
Kafka Egress (MBps)
Cross-AZ traffic (MBps)
Cross-AZ rate (GB)
As your throughput grows and fanout increases, you can see how networking comes to dominate your infrastructure costs. Moreover, once Tiered Storage and KIP-405 are available to reduce your storage costs, networking alone can comprise ~90% (!) of your infrastructure costs. That’s why we’ve focused so much on reducing networking costs at Confluent with our disaggregated networking and storage tiers and cloud provider partnerships.
High networking costs are also why it’s so important to enable follower fetching for consumers (KIP-392). You don’t incur any additional network costs if you read from a follower within the same zone rather than a leader in another zone. You only pay the cross-zone replication cost once, regardless of how many consumers read the data, resulting in significant network cost savings.
Once you have a breakdown of your Kafka costs, attack the biggest line items first. Step one is likely to enable follower fetching (currently in early access in Confluent Cloud) to reduce those cross-AZ charges. You can also reduce the replication factor on non-critical topics to optimize networking costs related to partition replication. Next, test different instance types and pick the one that most efficiently fits your needs. Last, scale up and down your cluster based on workload patterns so you aren’t paying for under-utilized infrastructure.
Having worked with many customers to compare the costs of self-supporting open source Kafka with both our self-managed software (Confluent Platform) and fully managed cloud service (Confluent Cloud), we do think it’s important to note that we’ve found the most economical path—outside of exceptional cases—is to leverage a fully managed cloud service. Confluent Cloud has a unique, cloud-native architecture that drives down costs, along with inherent at-scale advantages that drive economies of scale. In fact, we’ve discovered that many of our customers can cover the cost of Confluent Cloud on infrastructure savings alone, especially as they scale.
We’ll cover our architecture in greater detail in part three of this blog series, including some of the engineering investments we’ve made to reduce the infrastructure costs associated with running Kafka:
Optimizing our cloud networking stack and simplifying follower fetching to reduce networking costs
Optimizing storage placement between local disks and object storage to reduce storage costs
Optimizing client request processing to reduce compute costs
Optimizing infrastructure utilization with elastic scaling, client quotas, resource limits, and more
Negotiating hefty scaled discounts with cloud providers as savings to pass onto our customers
Be sure to check in next week for the second installment of this series that will focus on the other key cost driver for Kafka—development and operations personnel (i.e., engineering time and resources). If you’d like to estimate your Kafka costs or explore how much you can save with Confluent, we encourage you to check out our cost calculator as a first step today!
Monthly storage cost (with Tiered Storage) = Kafka MB ingress / sec * 86,400 secs / day * Retention days * 0.001 GB / MB * (% hotset data stored locally * Replication factor * Cost of GB / month on EBS + (1 - % hotset data stored locally) * Cost of GB / month on S3)
Cross-AZ throughput (MB / sec) = Producer cross-AZ throughput + Consumer cross-AZ throughput + Partition replication cross-AZ throughput = (Kafka MB ingress / sec * ⅔) + (Kafka MB egress / sec * ⅔) + (Kafka MB ingress / sec * 2)