Kafka in the Cloud: Why it’s 10x better with Confluent | Find out more
Apache Kafka® has become the de-facto standard for streaming data, helping companies deliver exceptional customer experiences, automate operations, and become software.
As companies increase their use of real-time data, we have seen the proliferation of Kafka clusters within many enterprises. Often, siloed application and infrastructure teams set up and manage new clusters to solve new use cases as they arise. In many large, complex enterprises, this organic growth has resulted in bloated tech complexity and cost. There comes a point to ask:
When should the enterprise pivot from bottom-up, organic growth of siloed data streaming projects to a group-wide platform approach?
What is the optimal architecture, infrastructure, and team structure to support an enterprise-wide data streaming platform?
What’s the most cost-effective approach?
At Confluent, we assist and advise numerous organizations in establishing their own enterprise-wide data streaming platform strategies and operating models. This blog post will guide you toward answering the questions above.
Let’s look at one of our real-world (anonymized) customer stories.
One of Confluent’s global enterprise customers experienced rapid growth—both organic and through M&A. Their fast-paced expansion resulted in over 40 production Kafka clusters, roughly divided up across 15 different teams. At the time of their refactoring, their infrastructure included a combination of:
Open source Kafka clusters
Kafka clusters in the public cloud, including clusters using Confluent Platform
Confluent Cloud—our fully managed service
While rapid growth is certainly a reason to celebrate, this enterprise customer found the situation had led to a proliferation of standards, tools, and processes around their Kafka architecture. Each team worked largely independently of one another, innovating what they saw fit for their use cases at the time. But when it came time to start building on top of their successes, for instance, by sharing data across teams and creating new derivative products, they found it extremely difficult.
Why? They lacked a common data streaming strategy and had no common standards, tooling, or processes for working together. At the same time, they had 50 full-time workers (full-time equivalents or FTEs) doing the same sort of Kafka DevOps work embedded across the 15 teams. Lots of repetition but no cohesive integration plan.
They also struggled with running mission-critical workloads on some of their Open Source Kafka clusters. Root cause analyses of incidents and outages revealed chronic break-fix work, due to issues with monitoring, disks, networking, partition balancing, misconfigurations, and lack of upgrades. When one team fixed a specific issue, another team would run into the same (or very similar) issue a short time later.
Adopting a concise data streaming strategy became the first step in plotting a way forward, including identifying tooling, standards, and areas of common ground where they could reduce their efforts while maintaining their product quality. Next, we worked with them to establish a new platform supporting the enterprise and a dedicated Kafka-as-a-Service (KaaS) team.
Focusing on a common data streaming strategy and reducing customized and siloed work, they reduced the headcount of Kafka DevOps FTEs from 50 people to 15, resulting in a savings of 70% in Kafka engineering costs. More importantly, the freed-up engineers could focus on actual business problems instead of repetitive break-fix work, providing higher value to the business (and a more engaging work environment).
Their Kafka clusters were consolidated into two main offerings—both cloud and on-premise services. Selective consolidation of workloads into fewer clusters resulted in a savings of almost 50% in infrastructure costs while delivering higher reliability and performance.
In many ways, determining the optimal Kafka cluster strategy aligns with common architecture design principles. At the recent 2023 AWS Re:Invent, Senior Vice President of AWS Utility Computing, Peter DeSantis, started his keynote with a slide showing the six most important attributes of a cloud computing service. They were: elasticity, security, performance, cost, availability, and sustainability. These all play a part in determining your Kafka architecture too. Dr. Werner Vogels, VP and CTO of Amazon.com, opened his keynote with the message “Architect with cost in mind”. Cost is a key component, and he argues that cost is a close proxy for sustainability. We're looking specifically at Kafka, so let’s assume we’re designing according to the sound architectural principles above. What do we also need to consider specifically for Kafka clusters?
A Kafka cluster strategy should align with a data strategy, which should consider higher-level data-centric technical and social concepts, such as:
Data mesh: Domain-driven data ownership, data as a product, a self-serve data platform, and federated computational governance.
Data products deserve a special mention, as they provide a high-quality, ready-to-use set of data that can be used across an organization and easily applied to different business challenges—exactly what the streaming platform supports. McKinsey identified that data products can help deliver use cases as much as 90% faster.
Domain-driven design: The enterprise architecture should mirror the business structure as much as possible. Useful for identifying boundaries, defining, and building data products.
Together, these data-centric concepts provide a framework for identifying, discussing, and solving the most common data problems you’ll run into. You’ll be able to formalize, as strictly or loosely as you choose, a cognitive and procedural model for defining, building, sharing, evolving, deprecating, and deleting your data. You’ll find that you have far fewer data-related outages and broken pipelines, less reliance on tribal knowledge, and the ability to actually trust and use your data on business problems instead of simply finding and fixing yet another broken data set.
A key component here is “centralization”. A common central platform brings both benefits and tradeoffs. Finding the right balance depends on individual business needs, specific project requirements, technical requirements, legal requirements, and cost and service optimizations.
So, how should you design an enterprise’s Kafka cluster architecture? Unsurprisingly, we can’t offer a universally correct answer, as there is no one-size-fits-all. However, there remain some core considerations that are common to any strategy. Let’s take a look.
Centralization vs decentralization is a key factor in determining your cluster strategy. The following diagram illustrates a sliding scale of options between one extreme and the other.
Multiple Kafka clusters anchor the decentralized end of the spectrum, each serving a specific solution, project, department, geographic location, or business unit. These clusters are siloed. They are independently owned and operated, typically with minimal or no collaboration.
Taking a step to the right towards centralization brings us to some consolidation of Kafka clusters. Many clusters remain fully independent, but some are consolidated together to reduce DevOps and infrastructure costs.
The next step in centralization is an enterprise-wide cluster + consolidated clusters. Some independent clusters remain, but there is also a central cluster to store commonly used data. Data may also be replicated into and from the central cluster, acting as a hub for data connectivity. Confluent recommends well-established patterns, such as stretch clusters and connected clusters, for replicating Kafka topics into more than one data center or availability zone. Finally, at the right of the diagram is full centralization: a single enterprise-wide cluster. All data streaming workloads are combined to realize the benefits of a company-wide service.
Many organizations start at the left of the spectrum. As data streaming gains wider adoption, organizations tend to scale by standing up new clusters. But there are three strategic reasons to want to move to the right of the spectrum and consolidate use cases on existing clusters or design a new central platform approach entirely:
There are also disadvantages to fully centralizing a service:
A centralized service must meet the superset of all its constituents’ requirements, even those that are challenging and expensive to meet. A client that requires extremely low latency compared to its peers can impose unnecessary and expensive restrictions. It may be better to serve special requirements through decentralized purpose-built clusters.
Conventional enterprise service management migrations are challenging and can fail for many reasons. Establishing an enterprise service requires significant organizational and process transformation, achievable goals, and project scoping.
Establishing a centralized service requires finding consensus with stakeholders. Aside from supporting technical requirements, you’ll also need to account for usage-based billing, prioritizing feature requests, providing on-call support, and communicating best practices.
Teams often start with a decentralized approach simply because it’s the easiest path. They can get started quickly and aren’t accountable to anyone else. That said, as the complexity of the decentralized approach grows, there comes an inflection point at which increasing costs and reduced quality of service start to make consolidation a much more appealing choice.
When deciding the degree of centralization best for you, it’s important to consider costs (and value), business factors, and technical factors. Let’s take a look at these.
Cost is one of the top concerns when investing in any technology. It’s easy to assess your costs when using a single-tenant cluster. Your bill is simply the sum of the resources you use. It can become more challenging when you move into a more centralized multi-tenant cluster, where you start sharing resources with other teams. When sharing resources, you need to ensure fair allocation of costs across teams, account for project return on investment (ROI), and plan for cluster scaling based on user growth and upcoming features.
But outside of costs, you must also consider value, as minimizing the measurable costs should not be your number one goal. The cheapest product or service option isn’t necessarily the best choice, especially if this lacks the features and capabilities you need for your business operations. Some businesses select the product with the lowest sticker price, ignoring the additional in-house work and opportunity costs required to integrate it into their solutions. Others may choose the most featureful (and expensive) option but barely utilize it, much like buying a sports car to drive in bumper-to-bumper traffic. In both cases, you’re getting poor value for your dollar. To help you find the optimum balance of cost and value, we have a few recommendations for you:
Estimate total cost of ownership (TCO): TCO is composed of tangible costs like infrastructure and hardware costs, licenses, subscriptions, usage-based service charges, and development and operations costs. Hidden costs and risk costs, which often constitute a large portion of TCO, are much less tangible. These include toil and break-fix work, customer management during service degradation, on-call outage resourcing, and security breaches. Fully managed services like Confluent Cloud can mitigate a huge portion of these risks and costs (as we outline in this blog post), letting you focus instead on getting work done.
Determine enterprise budget: Multi-tenant services require a well-defined funding plan. Don’t underestimate the ‘lift’ required to allocate this budget, especially if it comes from the decentralized project pools. Executive sponsorship and an internal sales campaign task force are often necessary to renegotiate the existing budgets.
Determine division of costs: Establish clear and well-documented rules around how costs are accounted for and charged back to the users of the centralized service. One common strategy is to adopt a chargeback mechanism to bill individual business units directly. Usage metrics, including throughput, scaling, network traffic, disk usage, and replication factors, should be part of your accounting.
The following business considerations are key when determining which Kafka clusters to consolidate against which to leave on separate clusters: