Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

The (Hidden) Cost of Kafka Operations

Written By

In Part one of our blog series on understanding your Kafka costs, we covered the infrastructure costs required to run your Apache Kafka® cluster(s), highlighting the outsized (and often overlooked) role networking plays in your cloud bill. Infrastructure isn’t the sole cost of operating Kafka though, as you’ll also need engineering talent to configure, manage, and scale your clusters. Similarly, you’ll also need to develop your own supporting software to implement end-to-end data streaming capabilities, especially as Kafka usage scales and becomes increasingly mission critical.

Kafka has been tapped by companies across every industry to connect all parts of their organization and tech stack in real time, providing a scalable, performant, and durable platform to power event-driven architectures and microservices. However, as with any critical technology, realizing these benefits comes with the commensurate investment of engineering time and resources. Naturally, there is an inherent cost that comes with dedicating highly skilled engineers to configuring, deploying, and managing Kafka and other data streaming components. There’s also a corresponding opportunity cost—each hour allocated to cluster upgrades, rebalancing topic partitions, or addressing unplanned downtime is an hour lost working on other engineering projects that are core to your business.

As resources and budgets become increasingly constrained in these times, teams that are smart about leveraging managed services are able to move up the stack and connect that service to the key internal systems, initiatives, and technologies of their company to help their organization realize the full value from it. This frees them up to figure out, for example, how to apply data streaming in machine learning apps, rather than being bogged down in yet another Kafka upgrade, or struggling to get up and operational in a new region or environment.

In this second installment, we’ll walk through the major activities that drive development and operations costs, and how you can assess their impact on your Kafka budget. We’ll also briefly touch on the more intangible costs associated with running Kafka—items like downtime and security risk—which are harder to quantify but are very real considerations to bear when deciding to self-support Kafka in-house.

Table of contents

  1. Estimating development and operations costs

  2. Capacity planning

  3. Platform development

  4. Cluster scaling

  5. Ongoing maintenance and support

  6. Intangible costs and risks

  7. So how can I save money?

Estimating development and operations costs

The math for calculating your development and operations costs is fairly straightforward. You determine how much time your team will spend on getting your cluster to production and managing it going forward, and then multiply that figure by the cost of that time. Of course, estimating the value of those two variables can be a bit handwavy depending on how meticulous you are at tracking your time and activities—however, as engineering teams are increasingly asked to do more with less, getting a true handle on where your teams are spending their time becomes critical.

The amount of engineering time allocated to running Kafka naturally varies across organizations, workload, and scale. Kafka adoption tends to start via pockets of early experimentation across development teams. Once it gains traction with a few individual projects moving to production, we typically see the equivalent of two engineers responsible for cluster development and operations to ensure a highly available and performant environment. The engineering workload can be shared across several people on a part-time basis rather than individuals solely focused on Kafka, but that is immaterial to our cost calculations because we’re measuring the overall engineering time required.

For our analysis, we’ll use a conservative estimate of two full-time engineers to set a cost figure for a smaller production use case. We’ll also want to set a more substantial estimate for running Kafka at larger scales. Based on Lyft’s engineering blog post on operating Kafka at global scale, their streaming team consists of approximately seven to ten engineers, so we’ll use that as our larger estimate. We should note this is likely an underestimation of the engineering resources required for most organizations trying to operate at that scale, as Lyft has poured in extensive engineering expertise and investments into their underlying Kafka infrastructure.

The cost of that engineering time also varies across organizations and geographies—given Kafka’s continued growth and increasing criticality, engineers with Kafka expertise are consistently associated with the highest salaries in tech. Talent.com estimates the salary for a Kafka engineer in the United States to be ~$140K annually—let’s assume you’re operating within the United States and use that salary figure for our calculations. The total cost of employment can be quite different from salary however; once you factor in things like benefits, bonuses, and stock packages, the fully loaded cost of employment is about 1.25 to 1.4 times higher than salary alone.

Early Stage Estimate

At Scale Estimate

Engineers running Kafka

2

7

Annual salary

$140,000

$140,000

Annual Development & Operations Cost – Salary

$280,000

$980,000

Annual Development & Operations Cost – Fully loaded

$350,000

$1,225,000

Engineers who run Kafka are immensely valuable, but their skillsets are almost certainly not limited to Kafka. You take on a significant opportunity cost when orienting your team’s time around building and managing low-level infrastructure rather than more strategic initiatives that directly impact your business. Real-time applications and critical data pipelines are more likely to differentiate your business compared to managing the underlying platforms on which they run.

There are many engineering tasks that come with self-managing Kafka—while Kafka may seem reasonable enough to get up and running, deploying a production-ready Kafka environment that can support mission-critical use cases becomes more burdensome and resource-intensive as you scale. In the remainder of this blog, we’ll focus on the major tasks these engineering teams are responsible for, which can drive up your overall development and operations personnel costs.

Capacity planning

Determining your throughput, latency, and storage requirements is often the first operational task you’ll face for Kafka. Subsequently matching your cluster’s capacity to meet the fluctuating demand of your workloads is a separate challenge, however, as the process is not as simple as simply determining a broker count. There are a host of other details you’ll need to consider and test: machine types, disk types, disk sizes, load balancers, broker configurations—the list goes on.

Underprovisioning capacity inevitably leads to some combination of poor latency, insufficient throughput, or even cluster downtime. The downsides of underprovisioning are so severe, particularly for mission-critical pipelines, that overprovisioning often becomes the only acceptable alternative for the business. From our experiences serving customers who self-manage Kafka, we see an average CPU utilization of only ~25%. Of course, this just creates a different cost in the form of under-utilized infrastructure, as discussed in our prior blog.

With Confluent’s serverless offerings, users can skip the capacity planning process entirely. There is no upfront sizing, and the clusters can elastically expand and shrink as the workload changes, regardless if within the day or over a longer duration. This removes the need to overprovision (and overpay) for the infrastructure and resources on which Kafka runs.

Platform development

Kafka is an open source project with many powerful tools, but it isn’t a complete data streaming platform on its own. Connectors, security and governance features, geo-replication tools, and other functionalities need to be integrated from third parties or custom-built in-house. For example, metadata catalogs, role-based access control, and monitoring tools are not provided as part of the core Kafka project, and must be developed and integrated on your own.

Among our self-managed customers operating at larger scales, we see significant adoption of key components beyond the core Kafka broker (shown below). While Kafka may be the start of the data streaming journey, you’ll inevitably need additional features to implement and mature your use cases at enterprise scale.

Adoption of key capabilities among top two quartiles (by scale) of self-managing customers

Components beyond core Kafka

Usage rate at scale

GUI-based Monitoring

>90%

Connectors

>90%

Schema Registry

>90%

Stream Processing

~80-90%

Geo-replication

~80-90%

Advanced Security Controls

~70-80%

Kubernetes Operator

~50-60%

The development of these components can be deceivingly time-consuming and costly. In fact, we know this firsthand; at Confluent, we have teams of engineers focused on designing, building, and maintaining all of the components around Kafka that our customers depend on. If you’re planning on building your own complete data streaming platform in-house, expect to add several more engineers to your cost equation. Then add the commensurate timeline to account for the development cycles required, along with the inevitable delays that come with ironing out all the corner cases and failure modes.

And the engineering burden for these tasks doesn’t end once reaching production. It’s important to remember that all of these components require ongoing maintenance to prevent outages and performance degradation.

Cluster scaling

Your cluster will inevitably face variations in throughput, whether from seasonal spikes like a Black Friday surge, or more lasting increases like adding a net new data streaming workload. As such, it’s highly likely you’ll need to scale your cluster to meet changing demands. Moreover, if like many of our customers, you operate your platform with a shared services team, it is almost impossible to accurately predict this demand across multiple teams as Kafka usage scales.

While Kafka is massively scalable, the scaling process itself is certainly not without challenges. You need to configure each additional broker, provision new infrastructure resources, and rebalance partitions to evenly distribute the workload and realize any performance improvements. These steps can be quite manual, meaning they are time-consuming, tedious, and prone to human error—especially if you want to avoid bringing your clusters down.

Teams from LinkedIn and Lyft have documented much of the pain associated with scaling their Kafka clusters, running into issues around cascading controller failures, slow broker startup and shutdown times, or frequent instance failures. On-call rotations become more burdensome when dealing with these issues, resulting in burnout for your valuable engineers who are crucial to keeping Kafka highly available and performant.

Ongoing maintenance and support

Given that Kafka is tapped to support many mission-critical use cases, you’ll also need to set aside time and resources for the ongoing upkeep of your cluster. Ensuring your cluster is upgraded and patched regularly is crucial to avoiding bugs and the negative impacts they can have (e.g., security incidents, performance degradation, data loss).

You may have had to scramble to deploy a critical fix for things like the Log4j vulnerability, only to realize you’ve completely underestimated what is involved in upgrading your old vulnerable Kafka version, and end up spending the rest of that night working through test scenarios to ensure it goes well in production. Painful experiences are standard when addressing CVEs and getting the call to suddenly conduct widespread patching.

There are also the other inevitable Kafka-related fire drills, such as teams wondering why latency has spiked from 20ms to 100ms for their real-time application. This may require pulling engineers off of other projects to troubleshoot and resolve whatever is causing the problem, whether it be from unbalanced topic partitions or a jump in producer traffic.

Intangible costs and risks

Intangible costs and risks—things like downtime, security incidents, and delayed time to value—are more difficult to estimate and won’t have their own line item in your Kafka budget. Quantifying these costs is understandably challenging, which is why they are often significantly underestimated or ignored entirely. The unfortunate reality is that incidents can and do occur, and the costs to your business are very real when they do—especially when you have to explain to your customers why one of your services they’re relying on was down or compromised.

It’s understandable to view any precise estimates of these line items with some level of skepticism. However, we still recommend you account for them with some high-level, directional analysis when evaluating your Kafka-related costs. While their estimation is inherently directional, it can still be a useful exercise to help justify both your infrastructure and personnel related investments.

Downtime risk

We often ask customers what happens when Kafka goes down—whether to their business or to the individual teams responsible for keeping the clusters up and running. Their answers vary depending on the criticality of their use case (“my world stops” is a personal favorite), but the impacts are almost always severe—lost revenue, lost data, customer dissatisfaction, fines, audits, SLA penalties, and so on. Translating these into an hourly cost for your specific workloads is more art than science, but you can multiply whatever cost you decide is most applicable by the number of hours you expect to be offline to estimate your total cost of downtime (see a simple example below).

Estimate

Hourly cost of downtime

$1,000

Hours of downtime per year (99% Uptime)

87.6 

Annual downtime cost

$87,600

Preventing said downtime and ensuring high availability is also not a free activity. Disaster recovery operations are complex with Kafka, requiring activities like failover design and testing. Replicating a Kafka cluster over to a backup location adds complexity to your architecture, creates a greater operational burden, and requires more infrastructure. And when a disaster inevitably occurs, you’ll also need to deal with things like DNS reconfigurations, application failovers, and offset translations, all while your downstream users anxiously wait for a status update.

Security incidents

We don’t need to convince you that security breaches and data leaks are costly to a business, most commonly in the form of lost business due to customer distrust, regulatory fines, and lawsuits. It’s important for you to take the necessary precautions to minimize the risk of an incident, though properly securing your cluster takes time, effort, and testing. 

We recommend you take a “defense in depth” approach—ensure the right ACLs are in place, that sensitive topics and fields are properly encrypted, and that user and client actions are traced for forensics and auditing. While Kafka has many of the features required for securing your clusters, you’re going to need to spend time and money organizing, managing, and maintaining it all. Allocating resources to building a cohesive data streaming platform with the necessary security controls requires even more investment, adding to your total Kafka-related spend.

Delayed time to value

So far, we’ve discussed engineering time as both a direct line item for your Kafka budget and as an opportunity cost—how much more impactful could you and your team’s time be spent working on other projects rather than dealing with all of the management responsibilities that come with running Kafka? 

However, there is an additional hidden cost, which comes in the form of delayed time to value—how much does my business stand to lose by not having my data streaming workload up and running today? For certain revenue-generating or cost-saving use cases, each day you’re in production is valuable and adds to the top and/or bottom line.

An unfortunate reality is that infrastructure teams are sometimes perceived as bottlenecks to application developers. Any delays in deploying capacity or reaching production not only exacerbate your Kafka-related costs (and potentially require uncomfortable conversations with management), but also deepen this perception of having the pace of innovation dictated by your infrastructure teams. Teams who have figured out how to effectively leverage managed services are best positioned to free themselves to move up the stack and be part of the solution rather than the perceived problem.

So how can I save money?

Simply put, the best way to reduce these costs is to move to a fully managed Kafka service. By offloading these responsibilities to a provider solely focused on offering a cloud-native data streaming service, you get to spend your time and resources elsewhere while also significantly reducing the risk of downtime and security breaches. Keep in mind that while many vendors position themselves as fully managed, only Confluent Cloud is a truly serverless offering that abstracts away all the complexity we’ve covered.

In our next blog, we’ll dive into the many engineering investments we’ve made to reduce the cost drivers covered so far in our series. For now, we can tease this by sharing that we’ve invested over five million engineering hours to build a unique cloud-native architecture that offers a significantly more efficient and performant version of Kafka to reduce your infrastructure footprint, development and operations burden, and intangible risks (while layering on many features beyond Kafka crucial to implementing data streaming use cases end to end).

In the meantime, if you’d like to see how much you can save by migrating to Confluent, we encourage you to check out our cost calculator and get a customized report on how you can reduce costs for your specific environment and workloads—check it out today!

  • Nick Bryan is a Director of Product Marketing at Confluent, where he focuses on connectors, stream processing, and governance features. Prior to Confluent, he worked at IBM in its analytics consulting practice. He is roughly 60% water.

Did you like this blog post? Share it now