Join us at Current New Orleans! Save $500 with early bird pricing until August 15 | Register Now

Why Hosted Apache Kafka® Leaves You Holding the Bag

Écrit par

Many teams begin their data streaming journey through their cloud provider, drawn to the simplicity of the one‑click “Create Kafka Cluster” button in their cloud console. It’s fast, feels integrated, and promises to “just work”—abstracting away all the operational tasks that only get more complicated in the cloud.

But under the hood, what’s often marketed as “managed” Kafka is really just open source “provisioned” Kafka, wrapped in a thin layer of automation. Your team is still on the hook for the hardest parts: tuning partitions and leadership, rebalancing under load, scaling brokers and storage, patching outdated versions, and firefighting failures when brokers degrade or disks fill up.

This blog will explore three major challenges that come with hosted Kafka—and how a fully managed Kafka service on Confluent Cloud addresses them in a fundamentally better way. 

  • Lack of Cost-Efficient Scaling. Eliminating overprovisioning and hidden costs through innovations like diskless storage and elastic compute.

    ‎ 

  • Keeping Brokers Alive. Detecting and healing brownouts, recovering from disk and zonal failures, and avoiding manual failovers entirely.

    ‎ 

  • Maintaining Apache Kafka. Isolating tenants, avoiding noisy neighbors, and staying current with Kafka and hardware advancements—without downtime or migration.

Kafka Cost Savings Webinar

For a live discussion and demo on managed vs. hosted Kafka, join us for our webinar, “‘Managed Kafka’ is Costing You: How to Save 70%” on July 30th.

Register Now

How Do Managed vs Hosted Apache Kafka® Services Impact Your Bottom Line?

When a “managed” service isn’t really managed, the operational burden still falls on you. We’ve seen this time and time again with customers who start with a provisioned offering, only to realize they’re still holding the lion’s share of the operational responsibility.

SecurityScorecard experienced this firsthand. Running hosted Kafka internally led to constant scaling challenges, uptime issues, and on-call fire drills—pulling valuable engineering time away from product innovation.

The Impact of Hosted Kafka

“MSK was not meeting our operational needs as doing something like a version upgrade was hard and very manual. We spent a lot of time trying to figure out cluster size and had some challenges with figuring out the number of brokers to set up." 


– Brandon Brown, Sr. Software Engineer, SecurityScorecard

Fully managed Kafka isn’t just about hosting it on some VMs in the cloud—it should be something fundamentally different. It should take the operational burden off your plate entirely. Think elastic scaling, zero-downtime upgrades, intelligent rebalancing, and proactive incident resolution. No more worrying about brokers, partitions, disk capacity, or version drift. Your team builds; the platform runs. That’s what fully managed Kafka should really mean. And this is the promise of Confluent Cloud.

Confluent Cloud, powered by the Kora engine, redefines what it means to run Kafka in the cloud—it goes beyond basic automation and instead provides a fully managed cloud experience, eliminating the operational burden of Kafka, and doing so at lower cost than hosted alternatives. 

Confluent Cloud Savings

By migrating to Confluent Cloud, SecurityScorecard offloaded their Kafka operations, improved uptime, and saved over $1 million in infrastructure and operational costs.


Read the full story

Under the hood, Kora is more than just automation; it’s a cloud-native Kafka engine purpose-built to serve the Kafka protocol as a resilient, elastic, and efficient service at global scale. Today, it powers over 30,000 clusters across all major clouds, processing 3 trillion+ messages per day—with 99.99% SLAs and hands-free operations for customers. And because Kora’s architecture is so efficient, we can pass those gains directly back to you, delivering better performance and lower total cost of ownership compared to hosted Kafka.

Kora fundamentally re-architects Kafka for the cloud with key innovations. These aren’t just convenience features—they’re deep architectural changes that solve the hardest parts of running Kafka at scale and make it more efficient to operate than hosted options

  • Elastic Scaling: Elastic CKUs (eCKUs) and diskless clusters (Freight) allow compute to expand or contract on demand, eliminating overprovisioning and cutting costs for unpredictable workloads.

    ‎ 

  • Diskless Kafka: Stateless broker nodes write data directly to remote object storage such as S3 instead of persisting data locally on disk or using local block storage as the primary persistence layer. This simplifies scaling (since compute and storage can scale independently), and eliminates expensive cross-AZ replication traffic (particularly in public clouds).

    ‎ 

  • Cost-Effective Networking: Private Networking Interfaces (PNI) drop Elastic Network Interfaces (ENIs) directly into your VPC giving you the same cost-effective networking you use with your other AWS native services.

    ‎ 

  • Self-Healing: Continuously monitors brokers and infrastructure, proactively replaces degraded nodes, and rebalances partitions and leadership automatically to maintain availability.

Over 6,000 customers worldwide, including Notion, L’Oreal, Michelin, Nuuly, RBC, Meesho, Sainsbury’s and many others, all representing a variety of use cases and workload sizes, have turned to Confluent Cloud to run it at scale and achieve anywhere from 40-70% savings on Kafka costs. 

Now, let’s explore three of the most common—and costly—challenges customers face with hosted Kafka and why a truly managed approach makes all the difference.

Challenge 1: Can You Scale Apache Kafka® Without Breaking the Bank?

With hosted Kafka offerings, customers are forced to statically allocate broker compute, storage, and block storage throughput based on worst-case peak loads. This leads to high over-provisioning costs by at least 50% over peak, as well as paying for idle resources during off-peak periods.

And here’s the kicker, in most cases you cannot scale down after scaling up, leaving excess resources stranded and further amplifying the cost penalty. You can deploy scaling automation tools like Cruise Control to help manage partition rebalancing and load distribution—but in hosted Kafka environments, you're responsible for operating, tuning, and troubleshooting them yourself, introducing yet another potential point of failure and operational burden.

Figure 1: Self-managed and hosted Kafka solutions force a choice between overprovisioning and outages. For this workload, the predictable peak is ~650 MBps, the low is ~15 MBps, and the unpredictable peak due to unforeseen demand is 1,000 MBps. The resulting average over the time period is ~300 MBps

Networking is a primary cost driver for Kafka deployments, particularly cross-Availability Zone (cross-AZ) traffic, which is essential for multi-zone cluster high availability. These network egress charges are often obscured within broader cloud billing, distorting the perceived cost-effectiveness of the Kafka service.

Without explicit cost accounting, these charges can escalate to approximately 80% to 90% of total infrastructure costs once storage is optimized, often increasing unnoticed as throughput and fan-out scale. Furthermore, users face an architectural dilemma: choosing between network configurations that prioritize cost efficiency (e.g., VPC peering) for high-throughput streaming workloads or those that offer enhanced security (e.g., PrivateLink). Each choice presents a considerable trade-off between security, manageability, and financial implications.

How Does Confluent Cloud Deliver Fully Managed, Cost-Efficient Scaling?

In contrast, Confluent Cloud achieves true elastic, cost-efficient scaling (to 20GBps+) through deep architectural innovations in decoupling compute-storage, an intelligent management plane, ECKUs, and PNI.

Decoupling storage and compute layers has allowed us to implement a two-tiered storage system that keeps hot data on performant local disks while offloading aging segments to low-cost object stores. This not only permits dynamic resizing of broker and disk resources to active datasets but also minimizes data movement and resource drain during cluster expansion or contraction.

Elasticity is orchestrated by an intelligent control plane that provisions, bins, and rebalances microservices (brokers, controllers, proxies) based on real-time telemetry, and leverages queuing-theory models for precise load prediction and autoscaling. Self-Balancing Cluster algorithms continuously redistribute partitions and leadership in response to utilization spikes, while tenancy is modularized into “cells” for granular scaling and resource isolation. We’ve repeatedly seen customers running their workloads on autoscaling clusters can save 50% or more just on infrastructure costs.

Critically, Confluent Cloud also exposes high-level abstractions—ECKUs (a collection of capacity across multiple dimensions to support your workloads, including throughput, partitions, and client connections). This enables true pay-as-you-go billing where users are charged only for actual resource consumption, not for the underutilized headroom dictated by legacy static provisioning—thereby eliminating persistent cost inefficiencies endemic to traditional managed Kafka architectures.

Another key investment we’ve made is PNI, an AWS-native networking option that mirrors PrivateLink’s security and connectivity benefits, but within Confluent Cloud. Previously, users often faced a dilemma: choose a network configuration that offers low cost (VPC peering) for high-throughput streaming workloads, or one that provides high security (e.g., PrivateLink). Each option had its own pros and cons.

PNI provides the best of both options, delivering secure and cost-effective data transfer. It’s considerably cheaper than PrivateLink because it keeps traffic within your AWS private network, avoiding egress charges and simplifying complex network configurations. Look for a more detailed technical blog post on PNI in the near future.

Challenge 2: Who’s Actually Keeping the Brokers Alive With a Hosted Apache Kafka® Service?

Hosted Kafka solutions provide only a thin automation layer atop instances and block storage, delegating resilience responsibilities—including broker health monitoring, storage management, replication, and failover—directly to the customer. For example, when a broker exhausts its attached block storage, that broker becomes unavailable and partitions on that broker are lost, causing partition unavailability and risking a cascading cluster outage.

Simply expanding storage does not resolve the outage. The broker remains unavailable until Kafka's retention policy triggers and old data is deleted, freeing up space—a process that can take hours or days, during which time impacted partitions are inaccessible and client operations may fail.

Moreover, hosted offerings do not include native self-healing or telemetry-driven detection. Instead, customers must manually identify and replace failed or degraded brokers and attempt re-replication under pressure, often amidst the heightened risk of data loss or extended downtime.

How Does Confluent Cloud Manage Broker Health and Prevent Outages?

Confluent Cloud delivers resilience as a native architectural outcome—not an operational afterthought. By decoupling compute from storage and centralizing data retention across the cluster, it ensures durability and high availability at scale. Hot data is served from fast, local disks, while older segments are offloaded to durable, cost-efficient object storage—eliminating the risk of broker outages due to local disk exhaustion.

Unlike hosted Kafka offerings, which rely on basic host-level health checks, Confluent Cloud continuously monitors infrastructure at the Kafka layer for brownouts or degradation. When issues arise, brokers are automatically replaced, and their partitions and leadership reassigned using real-time telemetry, and partitions are rebalanced across the cluster—all without manual intervention. Thanks to integrated tiered storage, only active, non-offloaded data needs to move—accelerating recovery from days to seconds.

These self-healing workflows are fully automated and coordinated by Confluent’s control plane, ensuring uninterrupted availability without operator intervention.

Figure 2: Confluent Cloud offers industry-leading SLAs for organizations running mission-critical Kafka workloads

Confluent Cloud also includes multi-zone capability to withstand availability zone failures and a regional outage across public cloud providers. To minimize downtime and data loss in a regional outage, we introduced Cluster Linking and Schema Linking to help organizations architect a multi-region disaster recovery plan for Kafka.

These built-in features enable Confluent customers to keep data and metadata in sync across 65+ regions within all three major cloud providers, improving the resiliency of in-cloud deployments. Users can create active-passive or active-active disaster recovery patterns to achieve a low RPO (recovery point objective) and RTO (recovery time objective).

Beyond these functionalities, we’ve implemented robust battle-tested operational processes that include unit and integration tests, soak tests, scale tests, and failure injection tests to emulate faults and verify that each release does not regress in performance or reliability.

The result: high availability and fault tolerance aren’t layered on later— they’re built into the platform from the ground.

Challenge 3: Who’s Responsible for Maintenance Tasks With Hosted Apache Kafka®?

Hosted Kafka offerings place the burden of configuration, tuning, and upgrades squarely on the customer—especially in shared, multi-tenant environments where Kafka operators are responsible for multiple applications hosted on the same cluster. They provide no built-in workload isolation, so traffic spikes or connection storms from one application can destabilize the entire cluster.

Customers are responsible for managing partition layouts, quotas, and broker settings. Kafka version upgrades and instance type changes lag behind upstream releases and require manual intervention, often involving downtime-prone migration strategies to adopt critical patches or performance improvements. Cluster upgrades, including testing and rolling brokers, are entirely owned by the customer.

How Does Confluent Cloud Automate Apache Kafka® Orchestration and Upgrades?

Confluent Cloud addresses these challenges through automated orchestration and safe, zero-downtime upgrade workflows. Update windows are set for each cluster based on workload patterns and region/timezone, controlling when a Kafka roll can begin and minimizing the risk of customer impact during upgrades. Clients are isolated via per-client quotas and observability metrics.

Confluent also delivers new Kafka versions and hardware improvements within weeks of release, with upgrade orchestration handled automatically by the platform—no customer coordination or downtime windows required. This removes the operational complexity of managing Kafka at scale and allows teams to stay current and secure with minimal effort.

Cloud-Native by Design – With the Cost Savings to Prove It

In this post, we’ve outlined why running Kafka at scale is so difficult—and how hosted solutions often leave teams struggling with the same operational burdens they were hoping to avoid. 

Confluent Cloud’s architectural approach solves three of the most common pain points—enabling true elastic scaling, ensuring resilience without operator intervention, and managing large-scale multi-tenant workloads. These aren’t just surface-level improvements—they’re the result of 5 million engineering hours spent making Confluent Cloud truly cloud-native.

In upcoming posts in this series, we’ll go deeper into how these capabilities are built—unpacking the distributed systems challenges we solved so you don’t have to—and explore the customer use cases enabled by these architectural investments. 

From rethinking our private networking stack to reduce read/write throughput costs for customers, to designing autoscaling mechanisms that avoid overprovisioning without compromising latency, to orchestrating live upgrades without coordination or downtime—we’ll walk through the technical decisions, trade-offs, and architectural innovations that make Confluent Cloud not just easier to operate but also fundamentally more resilient and efficient at scale.

We’re Rewriting the Cost Equation for Kafka in the Cloud. So Let’s Talk.

You might be thinking: “I’d love to offload Kafka operations to Confluent Cloud—but isn’t it more expensive than hosted alternatives?” That may have been true in the past, but not anymore.

With the architectural efficiency of our Kora engine and the scale we’ve achieved across thousands of production clusters, we’re able to deliver significant cost savings—without sacrificing reliability or performance. For virtually any workload, we’re confident that Confluent Cloud can deliver a better price-performance profile than cloud hosted Kafka offerings.

Want to learn more? Join us on July 30th for a more expansive discussion on these topics and a live demo of how you can migrate from hosted Kafka services to Confluent Cloud. Register for the webinar right away.

‎ 

Apache®, Apache Kafka®, and Kafka® are registered trademarks of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of these marks.

  • David Nasi is a Senior Director of Product Management at Confluent, where he leads the Cloud Kafka Product group. Prior to Confluent, he led the product organization for AWS Lambda at Amazon Web Services. He began his career developing application infrastructure software at IBM. David holds degrees from MIT and the Technion – Israel Institute of Technology

  • Bharath leads the go-to-market product marketing team at Confluent. He has played an integral role in launching several end-to-end campaigns and GTM motions to generate market awareness and drive demand for Confluent’s platform. He holds an MBA from the McCombs School of Business.

Avez-vous aimé cet article de blog ? Partagez-le !