Apache Kafka® has become the de-facto standard for streaming data, helping companies deliver exceptional customer experiences, automate operations, and become software.
As companies increase their use of real-time data, we have seen the proliferation of Kafka clusters within many enterprises. Often, siloed application and infrastructure teams set up and manage new clusters to solve new use cases as they arise. In many large, complex enterprises, this organic growth has resulted in bloated tech complexity and cost. There comes a point to ask:
When should the enterprise pivot from bottom-up, organic growth of siloed data streaming projects to a group-wide platform approach?
What is the optimal architecture, infrastructure, and team structure to support an enterprise-wide data streaming platform?
What’s the most cost-effective approach?
At Confluent, we assist and advise numerous organizations in establishing their own enterprise-wide data streaming platform strategies and operating models. This blog post will guide you toward answering the questions above.
Let’s look at one of our real-world (anonymized) customer stories.
One of Confluent’s global enterprise customers experienced rapid growth—both organic and through M&A. Their fast-paced expansion resulted in over 40 production Kafka clusters, roughly divided up across 15 different teams. At the time of their refactoring, their infrastructure included a combination of:
Open source Kafka clusters
Kafka clusters in the public cloud, including clusters using Confluent Platform
Confluent Cloud—our fully managed service
While rapid growth is certainly a reason to celebrate, this enterprise customer found the situation had led to a proliferation of standards, tools, and processes around their Kafka architecture. Each team worked largely independently of one another, innovating what they saw fit for their use cases at the time. But when it came time to start building on top of their successes, for instance, by sharing data across teams and creating new derivative products, they found it extremely difficult.
Why? They lacked a common data streaming strategy and had no common standards, tooling, or processes for working together. At the same time, they had 50 full-time workers (full-time equivalents or FTEs) doing the same sort of Kafka DevOps work embedded across the 15 teams. Lots of repetition but no cohesive integration plan.
They also struggled with running mission-critical workloads on some of their Open Source Kafka clusters. Root cause analyses of incidents and outages revealed chronic break-fix work, due to issues with monitoring, disks, networking, partition balancing, misconfigurations, and lack of upgrades. When one team fixed a specific issue, another team would run into the same (or very similar) issue a short time later.
Adopting a concise data streaming strategy became the first step in plotting a way forward, including identifying tooling, standards, and areas of common ground where they could reduce their efforts while maintaining their product quality. Next, we worked with them to establish a new platform supporting the enterprise and a dedicated Kafka-as-a-Service (KaaS) team.
Focusing on a common data streaming strategy and reducing customized and siloed work, they reduced the headcount of Kafka DevOps FTEs from 50 people to 15, resulting in a savings of 70% in Kafka engineering costs. More importantly, the freed-up engineers could focus on actual business problems instead of repetitive break-fix work, providing higher value to the business (and a more engaging work environment).
Their Kafka clusters were consolidated into two main offerings—both cloud and on-premise services. Selective consolidation of workloads into fewer clusters resulted in a savings of almost 50% in infrastructure costs while delivering higher reliability and performance.
In many ways, determining the optimal Kafka cluster strategy aligns with common architecture design principles. At the recent 2023 AWS Re:Invent, Senior Vice President of AWS Utility Computing, Peter DeSantis, started his keynote with a slide showing the six most important attributes of a cloud computing service. They were: elasticity, security, performance, cost, availability, and sustainability. These all play a part in determining your Kafka architecture too. Dr. Werner Vogels, VP and CTO of Amazon.com, opened his keynote with the message “Architect with cost in mind”. Cost is a key component, and he argues that cost is a close proxy for sustainability. We're looking specifically at Kafka, so let’s assume we’re designing according to the sound architectural principles above. What do we also need to consider specifically for Kafka clusters?
A Kafka cluster strategy should align with a data strategy, which should consider higher-level data-centric technical and social concepts, such as:
Data mesh: Domain-driven data ownership, data as a product, a self-serve data platform, and federated computational governance.
Data products deserve a special mention, as they provide a high-quality, ready-to-use set of data that can be used across an organization and easily applied to different business challenges—exactly what the streaming platform supports. McKinsey identified that data products can help deliver use cases as much as 90% faster.
Domain-driven design: The enterprise architecture should mirror the business structure as much as possible. Useful for identifying boundaries, defining, and building data products.
Together, these data-centric concepts provide a framework for identifying, discussing, and solving the most common data problems you’ll run into. You’ll be able to formalize, as strictly or loosely as you choose, a cognitive and procedural model for defining, building, sharing, evolving, deprecating, and deleting your data. You’ll find that you have far fewer data-related outages and broken pipelines, less reliance on tribal knowledge, and the ability to actually trust and use your data on business problems instead of simply finding and fixing yet another broken data set.
A key component here is “centralization”. A common central platform brings both benefits and tradeoffs. Finding the right balance depends on individual business needs, specific project requirements, technical requirements, legal requirements, and cost and service optimizations.
So, how should you design an enterprise’s Kafka cluster architecture? Unsurprisingly, we can’t offer a universally correct answer, as there is no one-size-fits-all. However, there remain some core considerations that are common to any strategy. Let’s take a look.
Centralization vs decentralization is a key factor in determining your cluster strategy. The following diagram illustrates a sliding scale of options between one extreme and the other.
Multiple Kafka clusters anchor the decentralized end of the spectrum, each serving a specific solution, project, department, geographic location, or business unit. These clusters are siloed. They are independently owned and operated, typically with minimal or no collaboration.
Taking a step to the right towards centralization brings us to some consolidation of Kafka clusters. Many clusters remain fully independent, but some are consolidated together to reduce DevOps and infrastructure costs.
The next step in centralization is an enterprise-wide cluster + consolidated clusters. Some independent clusters remain, but there is also a central cluster to store commonly used data. Data may also be replicated into and from the central cluster, acting as a hub for data connectivity. Confluent recommends well-established patterns, such as stretch clusters and connected clusters, for replicating Kafka topics into more than one data center or availability zone. Finally, at the right of the diagram is full centralization: a single enterprise-wide cluster. All data streaming workloads are combined to realize the benefits of a company-wide service.
Many organizations start at the left of the spectrum. As data streaming gains wider adoption, organizations tend to scale by standing up new clusters. But there are three strategic reasons to want to move to the right of the spectrum and consolidate use cases on existing clusters or design a new central platform approach entirely:
There are also disadvantages to fully centralizing a service:
A centralized service must meet the superset of all its constituents’ requirements, even those that are challenging and expensive to meet. A client that requires extremely low latency compared to its peers can impose unnecessary and expensive restrictions. It may be better to serve special requirements through decentralized purpose-built clusters.
Conventional enterprise service management migrations are challenging and can fail for many reasons. Establishing an enterprise service requires significant organizational and process transformation, achievable goals, and project scoping.
Establishing a centralized service requires finding consensus with stakeholders. Aside from supporting technical requirements, you’ll also need to account for usage-based billing, prioritizing feature requests, providing on-call support, and communicating best practices.
Teams often start with a decentralized approach simply because it’s the easiest path. They can get started quickly and aren’t accountable to anyone else. That said, as the complexity of the decentralized approach grows, there comes an inflection point at which increasing costs and reduced quality of service start to make consolidation a much more appealing choice.
When deciding the degree of centralization best for you, it’s important to consider costs (and value), business factors, and technical factors. Let’s take a look at these.
Cost is one of the top concerns when investing in any technology. It’s easy to assess your costs when using a single-tenant cluster. Your bill is simply the sum of the resources you use. It can become more challenging when you move into a more centralized multi-tenant cluster, where you start sharing resources with other teams. When sharing resources, you need to ensure fair allocation of costs across teams, account for project return on investment (ROI), and plan for cluster scaling based on user growth and upcoming features.
But outside of costs, you must also consider value, as minimizing the measurable costs should not be your number one goal. The cheapest product or service option isn’t necessarily the best choice, especially if this lacks the features and capabilities you need for your business operations. Some businesses select the product with the lowest sticker price, ignoring the additional in-house work and opportunity costs required to integrate it into their solutions. Others may choose the most featureful (and expensive) option but barely utilize it, much like buying a sports car to drive in bumper-to-bumper traffic. In both cases, you’re getting poor value for your dollar. To help you find the optimum balance of cost and value, we have a few recommendations for you:
Estimate total cost of ownership (TCO): TCO is composed of tangible costs like infrastructure and hardware costs, licenses, subscriptions, usage-based service charges, and development and operations costs. Hidden costs and risk costs, which often constitute a large portion of TCO, are much less tangible. These include toil and break-fix work, customer management during service degradation, on-call outage resourcing, and security breaches. Fully managed services like Confluent Cloud can mitigate a huge portion of these risks and costs (as we outline in this blog post), letting you focus instead on getting work done.
Determine enterprise budget: Multi-tenant services require a well-defined funding plan. Don’t underestimate the ‘lift’ required to allocate this budget, especially if it comes from the decentralized project pools. Executive sponsorship and an internal sales campaign task force are often necessary to renegotiate the existing budgets.
Determine division of costs: Establish clear and well-documented rules around how costs are accounted for and charged back to the users of the centralized service. One common strategy is to adopt a chargeback mechanism to bill individual business units directly. Usage metrics, including throughput, scaling, network traffic, disk usage, and replication factors, should be part of your accounting.
The following business considerations are key when determining which Kafka clusters to consolidate against which to leave on separate clusters:
While Kafka has become the standard for data streaming, its high volume and low latency come with some technical complexity and costs. We recommend considering:
Maintenance and operations: It tends to be simpler to set up and manage one cluster than many individual clusters. Monitoring, authentication, authorization, scaling, capacity planning, upgrades, and problem-solving are all common requirements for running one’s own cluster. However, a large centralized cluster can often take longer to upgrade, as there are more dependencies and more clients. Some clients may require an update to a newer version, only discovered as the cluster upgrade is underway.
Resource pooling: A big advantage of centralization is resource pooling. Instead of having reserved overhead on each cluster, you can pool the overhead together in the centralized cluster, often reducing the cost of standby resources. The services running on the centralized cluster can rely on this common large pool of resources, without paying for the extra capacity on their own.
Performance and service level agreements (SLAs): Not all data streaming use cases have the same performance, availability, and redundancy business requirements. Some data streaming services may call for high availability through multi-availability zone clusters or cross-cluster replications. Other services may be able to tolerate intermittent outages without meaningfully degrading performance. Some services and workloads are more critical than others, and many organizations rely on a tiered service approach: Tier 1 services get the highest guarantees, while those in Tier 4 get best-effort support. Sometimes, it can be easier to meet your SLAs by isolating a workload in its own cluster. Noisy neighbor applications can have a significant effect on the performance and throughput of co-tenanted applications. You may choose to use operational decoupling, to prevent traffic from one workload from interfering with another. For example, Netflix uses separate clusters for producers and consumers to isolate resources for different use cases. The producer-oriented clusters get messages from all applications while consumer-facing clusters contain only a subset of the data needed for stream processing. Check out Bilgin Ibryam’s Netflix blog post for more information.
Technical redundancy and disaster recovery requirements: Aside from geographical and regional separation, you should consider separating clusters by workload criticality. Think of the blast radius due to a given cluster failure. A single cluster powering all business processes would have a huge blast radius, whereas a dedicated cluster per workload would have a small blast radius. However, it’s not just a boolean on/off. The outage duration is also very important, as a large Tier 1 cluster may only be unavailable momentarily due to the resourcing dedicated to keeping it up and well. Conversely, many small lower-tier clusters (not everything can be Tier 1!) may be down for much longer, greatly exacerbating the pain of the outage. It’s a good idea to selectively separate some workloads but ensure that you’re not simply trading one form of risk for another as equally bad.
Data gravity: Data can be difficult to move around. Generally, data engineers process and store data near its sources and sinks. Copying and transporting large volumes of data reliably and securely across large distances can be challenging and expensive.
Security: Users and business units should be able to authenticate and see only the datasets they are entitled to see. For example, using Confluent’s Role Based Access Controls (RBACs), users cannot see, modify, publish, read, or delete any datasets without authorization. With multiple clusters, each team can have administration rights only to their Kafka instance. Depending on your security requirements, you may use over-the-wire Transport Layer Security (TLS) encryption, encryption at rest, and field-level encryption.
Most enterprises adopt Kafka at a project level and unintentionally end up with multiple independent clusters across the organization. Consolidating clusters via centralization can make sense if your business is looking to:
Improve quality of service and offer data products to reduce the time-to-market (TTM) of new applications. Our findings suggest that investing in data products pays off by reducing TTM by up to 90%.
Standardize ways of working with data through consolidation and simplification. Process optimizations can result in the total cost of ownership of data streaming to decline by up to 30%.
Mitigate the risk of outages, performance degradation, and security breaches while reducing the data governance burden.
Carefully consider the tradeoffs that refactoring your cluster strategy entails. Keeping clusters separate and closer to the BUs/application teams in one organization may increase autonomy and facilitate innovation. In another, it will drive costs higher and create unnecessary complexity. We recommend you weigh all factors carefully before deciding your targeted level of consolidation.
Running multi-tenant distributed systems is challenging, which is one reason many people like cloud products. You can buy a fully running service with SLAs without having the pain and expenses of building and maintaining it. Confluent Cloud removes the operational drivers for maintaining separate Kafka clusters. Even when business requirements like data residency demand separate clusters, Confluent Cloud minimizes the management burden.
With Confluent Cloud, your team doesn’t need separate clusters to isolate noisy neighbors, enforce different security standards, or provide different SLAs. Confluent Cloud applies best practices everywhere for reliable, cost-effective data streaming. It aligns with the six most important attributes of a cloud computing service highlighted earlier: elasticity, security, performance, cost, availability, and sustainability.
Confluent offers a variety of cluster types available from multiple cloud providers in dozens of regions. If a cloud service isn't a fit for your team's needs, we also have Confluent Platform, which allows you to manage your own Kafka clusters on-premise.
Speak to us if you want help to determine your optimal enterprise data streaming platform strategy. We can help you model the TCO scenarios of data streaming, consider business and technical factors, and provide actionable recommendations. Please get in touch with us at email@example.com.