Kafka in the Cloud: Why it’s 10x better with Confluent | Find out more

How Apache Kafka Works: An Introduction to Kafka’s Internals

Written By

It’s not difficult to get started with Apache Kafka®. Learning resources can be found all over the internet, especially on the Confluent Developer site. If you are new to Kafka, I’d recommend starting with the Kafka 101 course. But if you’re really serious about Kafka you might want to delve a bit deeper: how it replicates data between nodes, what happens if replication fails, how do consumers scale out automatically but never lose their ordering guarantees, and what is Kafka’s secret sauce?

Jun Rao’s Kafka Internals course answers all these questions, and many more, by explaining how Kafka works from the inside out. And who better to do this than one of Kafka’s original co-creators. A man who has played a pivotal role in the technology from its first few lines of code all the way through to the big-name KIPs of today.

The fundamentals

This course starts out with a review of the fundamentals of Kafka. If you’re already familiar with the basics, you might be tempted to skip this section, but there is something unique about a Kafka overview from one of the original creators of the framework. Jun’s experience and insights make this module much more than a Kafka 101 review.

Try it out for yourself
Learn more about tuning Apache Kafka with this hands-on exercise.

Inside the Apache Kafka broker

When we produce an event to Kafka, it just gets appended to the topic, and when we consume events, we just retrieve the event at the next offset, right? Well, yes, but there’s quite a bit more going on than this description implies. Jun will take us inside the broker to explain the journey taken by each produce or fetch request.

We’ll look at the moving parts that handle the requests in a fast, efficient, and resilient way. We’ll learn about the physical storage of data in Kafka and how that factors into the process.

This look under the hood of the Kafka broker is more than just interesting, it also provides us insights into some of the configurations that we use every day when working with Apache Kafka.

Data Plane: Replication Protocol

Replication is one of the most important functions of the Kafka broker, and it handles it well. So well, in fact, that we often don’t think about it much, other than setting the replication factor of our topics. But when you consider how much we rely on replication to provide the durability and high availability that we’ve come to expect from Kafka, it probably warrants a deeper understanding of how it works.

In this module, Jun gives us just that, with detailed explanations and illustrated examples. He covers the roles of partition leaders and followers, the in-sync replica list, leader epochs, high watermarks, and more.

The Apache Kafka control plane

We’ve all heard about the ZooKeeper removal that was first announced with KIP-500, now in this module, we’ll get a close-up look at ZooKeeper’s replacement, KRaft. We’ll see some of the advantages of KRaft, such as improved scalability and more efficient metadata propagation. Then we’ll go through some step-by-step examples of KRaft metadata replication and reconciliation as well as how the active controller is elected from the available voters.

Consumer group protocol

Consumer groups are the almost magical component that allows us to scale Kafka consumer applications up or down with ease and safety. The technology behind that wizard’s curtain is the consumer group protocol. In this module, Jun gives us a thorough explanation of how the consumer group protocol works. He’ll cover the group coordinator, group membership, partition assignment strategies, and how they affect the rebalancing process.

We’ll also learn about group coordinator failover, group initialization, and partition offset tracking. We’ll even see detailed examples of the different partition assignment strategies in action, including the amazing CooperativeStickyAssignor. When all is said and done, you might still be wondering if it’s really magic after all.

Try it out for yourself
See the power of the consumer group protocol for yourself with this hands-on exercise.

Configuring durability, availability, and ordering guarantees

Apache Kafka is known for its strong durability, availability, and ordering guarantees, but how does it achieve those, and what do we have to do to take advantage of them? In this module, Jun will tell us about some of the key configurations, such as acks and min.insync.replicas, and how they affect durability and availability. We’ll also learn about some of the trade-offs inherent in these configurations.

We’ll wrap up this segment with a discussion of how the Idempotent Producer, along with message keys, can provide ordering guarantees strong enough to bank on, literally.


In event streaming systems where multiple events need to be processed to complete a single unit of work, transactions are essential. And Kafka’s transactional producer and Kafka Stream’s exactly once semantics (EOS) are on the job. In this module, we’ll see how failures in a non-transactional stream processing system can lead to corrupted data or worse. Then we’ll see how transactions protect us from such scenarios.

Since a Kafka topic is an immutable append-only log of events, we can’t just roll back, like we might in a relational database. So, we’ll see how Kafka uses a strategy of adding abort or commit markers to the log, along with setting consumer isolation.level=read_committed to give us the protection we need.

Topic compaction

Topic compaction is an alternative to Kafka’s default time-based retention. With a compacted topic, the goal is to have the latest value for every key. In this module, we’ll learn about the use cases and limitations of compacted topics, and then Jun will take us on a deep dive into the process of topic compaction. We’ll see how the compaction process works at the segment level and when compaction is triggered. We’ll also learn about the special way tombstones and transaction (abort/commit) markers are handled to give our applications time to process these important artifacts.

A compacted topic is guaranteed to have the latest value for a key, but there are times when it will also have older values. After going through this module, you’ll understand why.

Tiered storage

While it has always been possible to keep data in Kafka long term, tiered storage makes it much more affordable. By moving older data to an object store, such as Amazon’s S3, tiered storage dramatically reduces the amount of expensive local storage we need on our brokers. Join Jun in this discussion of the benefits of tiered storage and how it works.

While tiered storage will be coming to Apache Kafka via KIP-405, in this module we’ll see how we can put it to use today, with Confluent.

Cluster elasticity

Kafka clusters can scale up or down as needed by adding and removing broker nodes, but along with this ability comes the need to rebalance data across brokers. In this module, we’ll look at some different tools that we can use to keep our clusters in a balanced state. From the kafka-reassign-partitions.sh shell script that comes with Apache Kafka, to the full-featured Self Balancing Clusters from Confluent, we’ll look at how each of these tools works as well as some pros and cons to consider.


In order to achieve our high-availability and disaster recovery goals, we’ll most often need to operate in more than one data center or cloud region. This brings with it the need for geo-replication. Fortunately, as with cluster data balancing, we are presented with multiple solutions to this challenge. In this module Jun will walk us through the following tools for operating Kafka clusters in multiple locations:

  • Confluent Multi-Region Clusters
  • Kafka MirrorMaker 2
  • Confluent Replicator
  • Confluent Cluster Linking

We’ll learn about how each of these work and what their strengths and weaknesses are. You’ll still need to do some work to determine the best choice for your situation, but isn’t it great to know we have these tools available?

Try it out for yourself
Try out geo-replication with Cluster Linking in this hands-on exercise.

Next steps

If you’re ready to dive even deeper into Kafka’s internals, check out the resources below:

  • After 30 years as a developer, architect, project manager (since recovered), author, trainer, conference organizer, and homeschooling dad, Dave Klein landed his dream job as a developer advocate with Confluent. After two years with Confluent, Dave joined Tabular, where he helps developers use Apache Iceberg to get more value from their data and have more fun doing it.

Did you like this blog post? Share it now