Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

A Guide to Mastering Kafka's Infrastructure Costs

Written By

Optimizing costs and increasing efficiencies are on everyone’s mind right now. Interest rates are up, inflation is soaring, and a recession is looming. As a result, this might be the first time in many years that you’ve had to rationalize and justify your cloud costs. Your Apache Kafka® related spend is likely no exception.

At Confluent, we’ve worked with thousands of customers—across both our fully managed cloud service and self-managed software—to help them understand and right-size their Kafka-related spend to support all of their data streaming workloads. This experience, along with managing our own costs across operating tens of thousands of Kafka clusters in Confluent Cloud, has provided us with a deep understanding of the key cost drivers and optimization levers for Kafka.

Unfortunately, properly assessing the cost of running Kafka can be a difficult task. We often see analyses that focus solely on the compute and storage resources required to run the platform, but you may find it surprising that networking is actually Kafka’s biggest infrastructure cost and is too often ignored. Moreover, these analyses tend to overlook all the development and operations personnel costs that are harder to quantify, but are very real to your team.

That’s why we’re excited to kick off a four-part blog series where we’ll help you understand and optimize the costs of running Kafka. The first two blogs will focus on how to assess the key drivers of Kafka costs: (a) infrastructure and (b) development & operations resources. We’ll then shift to how Confluent has rearchitected Kafka to be a lot more efficient, before closing with a few thoughts from our founder.

In this first blog, we’re going to run through the infrastructure costs of running Kafka—i.e., compute, storage, networking, and the additional tooling you need to keep Kafka up and running smoothly. We won’t bury the lede—if you’re running Kafka in the cloud across multiple AZs (as most do for high availability), networking likely represents over 50% of your Kafka infrastructure costs. Let’s see how this ends up being the case.

Table of contents

  1. The base workloads for consideration

  2. Compute

  3. Storage

  4. A quick (but important!) note on utilization

  5. Networking: The big one

  6. So how can I save money?

  7. Appendix – Cost calculations

The base workloads for consideration

The above chart shows the breakdown of infrastructure costs for both a smaller (20/60 MBps, ingress/egress) and a larger workload (100/300 MBps, ingress/egress) for comparison purposes. It assumes the workload is running on AWS across three availability zones, and data is retained for seven days. We’re setting aside utilization—but more on that later. It also ignores discounting, but keep in mind that discounts typically scale as you spend more with the cloud provider—they do for Confluent Cloud.

Before we get into the details of the analysis, we should clarify upfront that we are omitting dozens of infrastructure components for the sake of simplicity. This includes things like load balancers, NAT gateways, Kubernetes clusters, monitoring solutions, and so on. However, these items are very real to ensure you have a production-ready Kafka environment—and are items that will only add to your infrastructure-related bills. From our experience, this can increase your infrastructure costs of self-managing Kafka by up to 25%. This analysis is also only calculating the costs of a single cluster, but you’ll probably have several across environments.

Therefore, this analysis is likely underestimating what you may end up spending to run and support Kafka on your own.

Note: For those interested in doing their own cost analysis, the formulas behind our calculations are listed below in the Appendix.

Compute

Let’s start with compute, which is usually the first place people look for savings despite only representing a small portion of infrastructure costs. This instinct stems from the pre-cloud world where compute infrastructure in general was hard to scale and was often coupled with storage.

To estimate compute costs for a Kafka cluster, you take the number of nodes and multiply by the cost of running the machine types for each of those nodes (as shown below for our two sample workloads).1

20 MBps Ingress Example Workload

Kafka Component

Nodes

Machine type

Hourly rate

Monthly cost

Brokers

6

m5.xlarge

$0.192

$829

Quorum Controllers

3

m5.large

$0.096

$207

TOTAL

 

 

 

$1,037

100 MBps Ingress Example Workload

Kafka Component

Nodes

Machine type

Hourly rate

Monthly cost

Brokers

15

m5.xlarge

$0.192

$2,074

Quorum Controllers

3

m5.large

$0.096

$207

TOTAL

 

 

 

$2,281

Simple enough, right? Of course, the real world is much more complicated. Finding the right number of brokers and optimizing your machine type for each workload can be quite challenging—we’ve chosen sample node counts and machine types here for illustrative purposes only.

We’re also ignoring a few Kafka components that are mission-critical for most use cases, such as Kafka Connect and Kafka Streams, which have their own infrastructure costs.

Storage

Storage is a bit more involved to properly calculate, because the pricing structure takes into consideration things like IOPs and throughput. In an effort to simplify things, we’re going to ignore those line items and just focus on local EBS storage costs. If you scale your clusters vertically or start to get into serious throughput loads, however, you will need to look carefully at extra IOPs and throughput costs.

To estimate storage costs, you can multiply your ingress rate, replication factor, and retention period to determine how much storage your cluster will need at any given time. You can then multiply that figure by the cost of storage (in this case $0.08 per GB-month for EBS).2

Ingress (MBps)

Replication factor

Retention (days)

Storage needs (GB)

GB-month rate

Monthly cost

20

3

7

36,288

$0.080

$2,903

100

3

7

181,440

$0.080

$14,515

Storage can be a significant line item for Kafka, but there’s good news here—Tiered Storage (KIP-405) will become available in open source Apache Kafka soon. Tiered Storage allows you to offload older topic data to cloud-based object storage (e.g., Amazon S3). Not only is this storage layer cheaper, but it also removes the need to pay for partition replication. This can decrease your storage costs by over 90% depending on how storage is distributed between local disks and object storage.3 Tiered Storage can also reduce your compute costs for certain high-retention use cases, as you no longer need to increase your broker count to scale storage.

In Confluent Cloud, we have been running our cloud-first storage engine, which leverages tiered storage, for years now. We audit every message for data integrity across the storage tiers. For a deep dive into Confluent’s durability auditing service, which proactively detects data integrity issues on trillions of Kafka messages per day, check out this blog post.

One other thing to note on storage—using instance-based volumes is questionable when running Kafka or a Kafka clone on Kubernetes. We have seen reliability and data loss issues when attempting to use instance-based volumes, so we recommend against it.

A quick (but important!) note on utilization

To this point, we’ve set aside the need to overprovision resources to account for variability in your workload(s). However, to ensure reliability and performance in the real world, compute resources need to be overprovisioned to protect against unexpected throughput surges, while storage resources need to be overprovisioned to avoid running out of disk space.

An unfortunate (but necessary) reality is that most self-managed infrastructure runs at very low utilization (sometimes as low as <20%), but those resources end up on your cloud bill regardless of their actual usage. That is driven in part by the difficulty of dynamically scaling resources up and down depending on the current workload without jeopardizing throughput and latency performance or risking cluster downtime.

Optimizing compute and storage utilization is crucial to managing your infrastructure costs. A lot of the engineering work we’ve done at Confluent is focused on helping our users improve their infrastructure utilization rate, such as:

  • Serverless clusters that let you autoscale

  • Tiered Storage to elastically scale storage and compute resources independently

  • Automated partition balancing to optimize broker performance and utilization

  • Client quotas for you to support multiple tenants/applications

  • Resource limits to ensure high availability and scalability

We’ll dive into these items more later on in our blog series.

Networking: The big one

Okay, here’s the big one. Kafka is a high throughput, low latency system so it shouldn’t be that surprising when you discover networking is the largest line item on the bill. Problem is, it’s a hard line to tease apart. Your cloud provider doesn’t say, “Here are your networking costs for Kafka.” No, the networking costs are all mixed up with all the other networking usage in your organization.

So what do we do? Well, let’s model it out. 

Kafka can have several networking costs, but the one that really sticks out is cross-AZ traffic. Let’s assume we are running a multi-zone (three zones) cluster for availability and resiliency. Running Kafka in a single zone is not advisable—zonal outages inevitably happen, and downtime puts your business (and your nights and weekends) at risk.

There are three drivers of cross-AZ traffic:

  1. Producers – a well-balanced cluster will place partition leaders across three zones, meaning that Kafka producers will write to a leader in another zone roughly two-thirds of the time

  2. Consumers – the same concept holds for consumers; a well-balanced cluster will have Kafka consumers read from a partition leader in another zone roughly two-thirds of the time

  3. Partition replication – assuming a replication factor of three (the default recommendation), leader partitions will need to replicate messages to follower partitions in two separate zones

In other words, for each byte of Kafka ingress into a single zone, there are multiple copies of that byte being sent to the other zones. We’ve modeled out how much cross-AZ traffic results from our workloads below,4 which is then multiplied by the standard cross-AZ charge of two cents per GB.5

Kafka Ingress (MBps)

Kafka Egress (MBps)

Cross-AZ traffic (MBps)

Cross-AZ rate (GB)

Monthly cost

20

60

93.3

$0.02

$4,838

100

300

466.7

$0.02

$24,192

As your throughput grows and fanout increases, you can see how networking comes to dominate your infrastructure costs. Moreover, once Tiered Storage and KIP-405 are available to reduce your storage costs, networking alone can comprise ~90% (!) of your infrastructure costs. That’s why we’ve focused so much on reducing networking costs at Confluent with our disaggregated networking and storage tiers and cloud provider partnerships. 

High networking costs are also why it’s so important to enable follower fetching for consumers (KIP-392). You don’t incur any additional network costs if you read from a follower within the same zone rather than a leader in another zone. You only pay the cross-zone replication cost once, regardless of how many consumers read the data, resulting in significant network cost savings.

So how can I save money?

Once you have a breakdown of your Kafka costs, attack the biggest line items first. Step one is likely to enable follower fetching (currently in early access in Confluent Cloud) to reduce those cross-AZ charges. You can also reduce the replication factor on non-critical topics to optimize networking costs related to partition replication. Next, test different instance types and pick the one that most efficiently fits your needs. Last, scale up and down your cluster based on workload patterns so you aren’t paying for under-utilized infrastructure.

Having worked with many customers to compare the costs of self-supporting open source Kafka with both our self-managed software (Confluent Platform) and fully managed cloud service (Confluent Cloud), we do think it’s important to note that we’ve found the most economical path—outside of exceptional cases—is to leverage a fully managed cloud service. Confluent Cloud has a unique, cloud-native architecture that drives down costs, along with inherent at-scale advantages that drive economies of scale. In fact, we’ve discovered that many of our customers can cover the cost of Confluent Cloud on infrastructure savings alone, especially as they scale.

We’ll cover our architecture in greater detail in part three of this blog series, including some of the engineering investments we’ve made to reduce the infrastructure costs associated with running Kafka:

  • Optimizing our cloud networking stack and simplifying follower fetching to reduce networking costs

  • Optimizing storage placement between local disks and object storage to reduce storage costs

  • Optimizing client request processing to reduce compute costs

  • Optimizing infrastructure utilization with elastic scaling, client quotas, resource limits, and more

  • Negotiating hefty scaled discounts with cloud providers as savings to pass onto our customers

Be sure to check in next week for the second installment of this series that will focus on the other key cost driver for Kafka—development and operations personnel (i.e., engineering time and resources). If you’d like to estimate your Kafka costs or explore how much you can save with Confluent, we encourage you to check out our cost calculator as a first step today!

Appendix – Cost calculations

  1. Monthly compute cost = Number of nodes * Hourly cost of machine type * 720 hours / month

  2. Monthly storage cost (without Tiered Storage) = Kafka MB ingress / sec * 86,400 secs / day * Retention days * 0.001 GB / MB * Replication factor * Cost of GB / month on EBS

  3. Monthly storage cost (with Tiered Storage) = Kafka MB ingress / sec * 86,400 secs / day * Retention days * 0.001 GB / MB * (% hotset data stored locally * Replication factor * Cost of GB / month on EBS + (1 - % hotset data stored locally) * Cost of GB / month on S3)

  4. Cross-AZ throughput (MB / sec) = Producer cross-AZ throughput + Consumer cross-AZ throughput + Partition replication cross-AZ throughput = (Kafka MB ingress / sec * ⅔) + (Kafka MB egress / sec * ⅔) + (Kafka MB ingress / sec * 2) 

  5. Monthly networking cost = Cross-AZ throughput MB / sec * 2,592,000 secs / month * 0.001 GB / MB * Cost of GB cross-AZ traffic

  • Addison Huddy is a Senior Director of Product Management at Confluent where he leads the Kafka Product group. Prior to Confluent, he held product roles at Pivotal and was an Apache committer. He started his career building data systems at Visa. Addison is a graduate of UCLA and Georgia Tech. Most mornings, you will find him riding on Zwift.

Did you like this blog post? Share it now