Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
Today I’d like to give a tour of the internals of Confluent’s Apache Kafka® service. Powering this is a next-generation engine, Kora. Kora is a cloud data service that serves up the Kafka protocol for our thousands of customers and their tens of thousands of clusters. Kora isn’t something you can download, or even something you could run outside our control plane and the rest of our cloud, but understanding how it works gives an idea of some of what goes into our service and some of the advantages it conveys.
It doesn’t in any way displace open-source Kafka for those wanting a system they can manage themselves—and we continue to contribute heavily to Apache Kafka—but it is what allows us to offer a true cloud service at a phenomenal scale. In the rest of this blog post, I’ll describe why we built it, some of the advantages it gives, and a bit about how it works under the covers.
When we launched Confluent Cloud in 2017, we had a grand vision for what it would mean to offer Kafka in the cloud. But despite the work we put into it, our early Kafka offering was far from that—it was just open-source Kafka on a Kubernetes-based control plane with simplistic billing, observability, and operational controls. It was the best Kafka offering of its day, but still far from what we envisioned.
We quickly realized that to build the service we imagined it wouldn’t be enough to just put Kafka on servers in AWS. The systems that have succeeded in fulfilling the promise of the cloud are very different beasts. S3 isn’t anything like running a dedicated NFS server per customer, and Snowflake isn’t just a bunch of Teradata warehouses on EC2. These systems are completely different in how they are architected, how they operate, and the experience they deliver to customers.
Well, any system design ends up being defined by the constraints it has to satisfy, and the challenges facing a cloud data system are quite different from those of a self-managed open-source download.
Apache Kafka is designed to be as easy as possible to set up and run with a variety of DevOps tools, provide a single cluster with high performance in any environment, be upgraded every year or so, serve users within a single company that works together to plan capacity and be operable by generalist operations and engineering personnel.
A cloud system has very different constraints. Some of the challenges are much bigger: it has to serve thousands of customers, be built from the ground up to be multitenant, be largely driven by data-driven software-based operations rather than human operators, provide strong isolation and security across customers with unpredictable and even adversarial workloads, and be something a team of hundreds can innovate in rapidly.
But some things are easier: it doesn’t need to be downloaded and installed so it can have multiple services each with a fair amount of complexity, it only runs on a tightly controlled hardware and software stack, it doesn’t need to be configurable for different environments, and it never needs to be operated by anyone but the team who builds it.
We realized that to make our vision a reality we had to build something designed for these constraints. The result is what we call Kora. It is 100% compatible with all currently supported versions of the Kafka protocol but is designed from the ground up to be a truly managed service.
Kora was engineered to do the following:
Be multi-tenant first, supporting thousands of tenants with strong isolation
Run across more than 85 regions in three clouds
Be operated at scale by a small on-call team (we estimate we are about 1000x more efficient at operations than the average Kafka team)
Disaggregate individual components within the network, compute, metadata, and storage layer
Intelligently manage data locality between memory, SSDs, and object storage to maximize throughput while minimizing cost and latency
Improve performance by deeply optimizing for the cloud environment and unique workloads of a streaming system in the cloud
Capture real-time usage to automatically optimize data placement, fault detection and recovery, and other aspects of operations
Optimize the cost structure of large-scale usage
Kora now powers 30,000+ Confluent Cloud fully managed clusters for thousands of customers, keeping their critical workloads running seamlessly while reducing the time and money spent on operations and infrastructure management.
This isn’t a replacement for open-source Kafka. Apache Kafka is still by far the best platform if you want a self-managed open-source data streaming system, and we continue to contribute very significant features to open-source Kafka. But Apache Kafka isn’t a native cloud data service, and Kora is. So let me tell you all about Kora.
Kora helps us scale our operations, but it also helps you. Here are a few of the benefits it brings:
Elasticity: Enables 30x faster scale up and down
Reliability: More than 10x higher availability when compared to the fault rate of our self-managed customers or other cloud services, allowing us to provide a 99.99% uptime SLA
Performance: Significantly lower latency than self-managed Kafka on similar hardware (see performance results below)
Cost: An optimized cost structure that saves customers money, as discussed in the Confluent Cost Savings Challenge
Compatibility: Kora implements Kafka’s protocol and it is fully compatible
I’ll dive into a few of these benefits and then talk a little bit about how it works under the covers.
We’ll do a deeper dive into performance in a future blog post, but here we’ll share a quick summary of the results. We did a simple small-scale test on identical hardware of Kora and our full cloud platform against open-source Kafka. In many ways this is a lopsided test since Kora has significant additional functions from extra processes that measure performance, check system health and data durability, interface with our control plane, and more.
However, even with this additional load, this test shows significant performance gains. Here we report the P95 end-to-end latency (from producer to consumer) with throughput and partition count varying. You can see that in the most demanding situations, Kora can significantly outperform Apache Kafka.
These results are getting even better. The light blue line represents the next iteration of our replication layer which further improves latency, especially in high partition count cases or under significant load.
Open source software is free. However, one of the surprising advantages of a cloud service is the ability to provide something that is not just better and faster, but also cheaper. The cost of an open-source data system comes from the people doing the operations as well as the infrastructure (servers, networking, observability, storage, etc.) to run it.
By carefully optimizing the use of cloud resources to drive down infrastructure cost, and by relentlessly automating management to reduce operational burdens, we can offer our service at a price that is cheaper than doing it yourself. This isn’t hypothetical, we actively challenge potential customers to do this analysis and even offer to pay a reward if our price fails to beat their costs.
Self-managed data systems often end up frozen in time. A specific version on particular hardware gets deployed and remains untouched except for (we hope) critical security patches or other mandatory maintenance. Kora is continuously updated. Enabling agile iteration at a very large scale without compromising stability is one of the most significant challenges of a cloud system.
Our multi-phase fleet update software allows us to roll out features gradually across our fleet and carefully progress them through performance and soak tests, live canary deployment, and manage risk through deployment across clusters in over 85 cloud regions.
This means our customers are always on the latest and greatest that Confluent has to offer. Security updates are rolled out quickly and our service continuously improves without them having to upgrade or rework their apps.
You can see this in action in the load for a particularly troublesome workload for a customer that had bad skew problems. As we upgraded our data balancing to a more sophisticated balancing algorithm, it was able to correct the skew and improve the latency across the cluster without the customer doing anything on their end.
So how does all this work? There is a lot to it. I won’t go through all the aspects of our cloud, control plane, and associated systems—just a few core elements that make Kora special.
The Kora engine consists of several distinct services. At the highest level, Kora is deployed within independent physical clusters across our compute layer. The unit of provisioning, however, is known as a logical cluster which represents an isolated tenant within that cluster. Logical clusters allow us to provide elastic scalability on top of the pool of compute resources within the cluster.
Multi-tenancy is one of the core aspects of the system. All layers, including networking, request processing, and I/O subsystems, must respect tenant boundaries and capacity allocations to ensure sufficient performance at the tenant level. This underlying infrastructure allows us to implement our core virtual customer-facing capacity abstractions (CKUs and clusters) and the associated limits they enforce.
These limits are not just between tenants. The service itself must react gracefully as it hits its limits. Unlike a self-managed system, it can’t simply be swamped by requests under too heavy a load, it has to proactively provide backpressure to allow capacity expansion and balancing to take effect.
We expose this same quotas and limits system to our customers as well so they can set cluster-wide quotas to ensure hard limits for the tenants sharing within their cluster.
A stateless network routing layer is responsible for interfacing with a wide variety of cloud networking technologies, terminating and rate-limiting connections, and routing traffic within the physical cluster. This layer is stateless and can rapidly autoscale with the load of the underlying cluster.
Kora manages customers across more than 85 cloud regions and ensures the replication of data both within availability zones and between regions. The first layer of replication is system-defined and ensures that replicas of a partition are spread across availability zones and that client traffic reads avoid unnecessary network charges.
Our replication layer allows for linking between regions with asynchronous Cluster Linking, which extends the reach of the replication protocol geographically.
Kora is built to operate large clusters with high numbers of customers and partitions. In this configuration, there is a fundamental trade-off based on how data is placed within the cluster. If a customer’s partitions are spread as broadly as possible (say randomly across machines in a large cluster), then individual writes will be small with little ability to batch network or disk I/O, significantly reducing performance and increasing request processing and replication overhead.
Furthermore, the blast radius of any failure or degraded server will be large, potentially touching at least one partition for a large number of customers. However, if a customer's partitions are too concentrated on a small number of machines (in the limit, just three), then downtime of any one machine, whether due to failure or during software rolls, will potentially remove a significant fraction of capacity for that customer and materially degrade performance.
In order to support predictable performance and failure handling, this trade-off cannot be dictated by cluster size, Kora utilizes a cellular-based architecture where tenants are placed into cells that are evenly distributed across availability zones for high availability. This is done to improve resource utilization while reducing the blast radius during failures and the overall impact of cluster upgrades.
When the demand from existing tenants in the cell grows and reaches the cell’s capacity, one or more tenants are migrated to less-loaded cells.
Given we have no indication of what a new tenant’s workload or utilization will be, Kora’s tenant placement mechanism chooses two available cells at random and assigns the tenant to the cell with the lower load. This mechanism allows us to favor cells with a low load and avoids placing all new tenants in a single cell, which could create hotspots as they scale.
Kora must simultaneously balance the storage of data between tiers in its storage hierarchy (memory, local storage, and object storage), as well as between servers that will service requests for that data. This balancing is powered by feedback loops driven by the real-time usage of the cluster.
Data is actively migrated between memory, SSDs, and object storage. This process keeps recently accessed data in memory or local storage for low latency while ensuring fairness across tenants. Data is tiered out to object storage in chunks from the local representation but retained on SSD as a cache for some period of time. The broker manages all access to tiered data to maximize access and minimize the cost of requests to object storage.
Aggressively tiering data has a direct impact on elasticity since this allows active balancing of partitions between servers with minimal movement of data.
The continual optimization of the placement of partitions across the cluster is particularly important to ensure consistent load and predictable performance. This balancing must simultaneously respect the AZ replication, cell placement rules, and tenant fairness while attempting to balance CPU and local I/O usage and minimizing the cost and impact of frequent partition movement or unstable churn in which partitions move back and forth due to varying load.
One of the core pillars of Kora is to build a system that is operated by software, not humans. There is a direct correlation between operational toil on our engineers, the availability of our service, and the experience our users have.
This type of automation is hundreds of small things rather than one big one. I’ll outline two illustrative categories.
The first is a set of checks on the liveness, performance, and correctness of our systems. For example, we run continuous checks on all data that is stored to prove that all stored data is retained for the required time and that no silent corruption or loss of data has occurred.
Another category is health checks and remediation. We also have thousands of health checks that continuously probe the brokers to detect any lapses in availability or performance. When running tens of thousands of instances, failures in underlying computing, storage, or other cloud infrastructure become commonplace. The system must be able to automatically detect and mitigate these issues before they become an issue for our users. Hard failures are easy to detect, but it is not uncommon to have failure modes that degrade performance or produce regular errors without total failure.
An example of this is the degraded storage remediation Kora has built-in. Kora runs a storage health manager that keeps track of storage operations. If storage operations stop showing progress, the broker is automatically restarted. If the issues persist, partition leadership is migrated away from the degraded broker and it is replaced.
Hopefully, that quick tour gives a sense of the system we’ve built. We’re incredibly proud of the hard work and dedication of the many engineers who’ve poured millions of hours of engineering to build this. For those who want to learn more, we’ll be publishing a paper that dives further into some of the internals.
Why do our customers choose Confluent as their trusted data streaming platform? In this blog, we will explore our platform’s reliability, durability, scalability, and security by presenting some remarkable statistics and providing insights into our engineering capabilities.
Companies are looking to optimize cloud and tech spend, and being incredibly thoughtful about which priorities get assigned precious engineering and operations resources. “Build vs. Buy” is being taken seriously again. And if we’re honest, this probably makes sense. There is a lot to optimize.