์ค์๊ฐ ์์ง์ด๋ ๋ฐ์ดํฐ๊ฐ ๊ฐ์ ธ๋ค ์ค ๊ฐ์น, Data in Motion Tour์์ ํ์ธํ์ธ์!
Itโs not difficult to get started with Apache Kafkaยฎ. Learning resources can be found all over the internet, especially on the Confluent Developer site. If you are new to Kafka, Iโd recommend starting with the Kafka 101 course. But if youโre really serious about Kafka you might want to delve a bit deeper: how it replicates data between nodes, what happens if replication fails, how do consumers scale out automatically but never lose their ordering guarantees, and what is Kafkaโs secret sauce?
Jun Raoโs Kafka Internals course answers all these questions, and many more, by explaining how Kafka works from the inside out. And who better to do this than one of Kafkaโs original co-creators. A man who has played a pivotal role in the technology from its first few lines of code all the way through to the big-name KIPs of today.
This course starts out with a review of the fundamentals of Kafka. If youโre already familiar with the basics, you might be tempted to skip this section, but there is something unique about a Kafka overview from one of the original creators of the framework. Junโs experience and insights make this module much more than a Kafka 101 review.
When we produce an event to Kafka, it just gets appended to the topic, and when we consume events, we just retrieve the event at the next offset, right? Well, yes, but thereโs quite a bit more going on than this description implies. Jun will take us inside the broker to explain the journey taken by each produce or fetch request.
Weโll look at the moving parts that handle the requests in a fast, efficient, and resilient way. Weโll learn about the physical storage of data in Kafka and how that factors into the process.
This look under the hood of the Kafka broker is more than just interesting, it also provides us insights into some of the configurations that we use every day when working with Apache Kafka.
Replication is one of the most important functions of the Kafka broker, and it handles it well. So well, in fact, that we often donโt think about it much, other than setting the replication factor of our topics. But when you consider how much we rely on replication to provide the durability and high availability that weโve come to expect from Kafka, it probably warrants a deeper understanding of how it works.
In this module, Jun gives us just that, with detailed explanations and illustrated examples. He covers the roles of partition leaders and followers, the in-sync replica list, leader epochs, high watermarks, and more.
Weโve all heard about the ZooKeeper removal that was first announced with KIP-500, now in this module, weโll get a close-up look at ZooKeeperโs replacement, KRaft. Weโll see some of the advantages of KRaft, such as improved scalability and more efficient metadata propagation. Then weโll go through some step-by-step examples of KRaft metadata replication and reconciliation as well as how the active controller is elected from the available voters.
Consumer groups are the almost magical component that allows us to scale Kafka consumer applications up or down with ease and safety. The technology behind that wizardโs curtain is the consumer group protocol. In this module, Jun gives us a thorough explanation of how the consumer group protocol works. Heโll cover the group coordinator, group membership, partition assignment strategies, and how they affect the rebalancing process.
Weโll also learn about group coordinator failover, group initialization, and partition offset tracking. Weโll even see detailed examples of the different partition assignment strategies in action, including the amazing CooperativeStickyAssignor
. When all is said and done, you might still be wondering if itโs really magic after all.
Apache Kafka is known for its strong durability, availability, and ordering guarantees, but how does it achieve those, and what do we have to do to take advantage of them? In this module, Jun will tell us about some of the key configurations, such as acks
and min.insync.replicas
, and how they affect durability and availability. Weโll also learn about some of the trade-offs inherent in these configurations.
Weโll wrap up this segment with a discussion of how the Idempotent Producer, along with message keys, can provide ordering guarantees strong enough to bank on, literally.
In event streaming systems where multiple events need to be processed to complete a single unit of work, transactions are essential. And Kafkaโs transactional producer and Kafka Streamโs exactly once semantics (EOS) are on the job. In this module, weโll see how failures in a non-transactional stream processing system can lead to corrupted data or worse. Then weโll see how transactions protect us from such scenarios.
Since a Kafka topic is an immutable append-only log of events, we canโt just roll back, like we might in a relational database. So, weโll see how Kafka uses a strategy of adding abort or commit markers to the log, along with setting consumer isolation.level=read_committed
to give us the protection we need.
Topic compaction is an alternative to Kafkaโs default time-based retention. With a compacted topic, the goal is to have the latest value for every key. In this module, weโll learn about the use cases and limitations of compacted topics, and then Jun will take us on a deep dive into the process of topic compaction. Weโll see how the compaction process works at the segment level and when compaction is triggered. Weโll also learn about the special way tombstones and transaction (abort/commit) markers are handled to give our applications time to process these important artifacts.
A compacted topic is guaranteed to have the latest value for a key, but there are times when it will also have older values. After going through this module, youโll understand why.
While it has always been possible to keep data in Kafka long term, tiered storage makes it much more affordable. By moving older data to an object store, such as Amazonโs S3, tiered storage dramatically reduces the amount of expensive local storage we need on our brokers. Join Jun in this discussion of the benefits of tiered storage and how it works.
While tiered storage will be coming to Apache Kafka via KIP-405, in this module weโll see how we can put it to use today, with Confluent.
Kafka clusters can scale up or down as needed by adding and removing broker nodes, but along with this ability comes the need to rebalance data across brokers. In this module, weโll look at some different tools that we can use to keep our clusters in a balanced state. From the kafka-reassign-partitions.sh
shell script that comes with Apache Kafka, to the full-featured Self Balancing Clusters from Confluent, weโll look at how each of these tools works as well as some pros and cons to consider.
In order to achieve our high-availability and disaster recovery goals, weโll most often need to operate in more than one data center or cloud region. This brings with it the need for geo-replication. Fortunately, as with cluster data balancing, we are presented with multiple solutions to this challenge. In this module Jun will walk us through the following tools for operating Kafka clusters in multiple locations:
Weโll learn about how each of these work and what their strengths and weaknesses are. Youโll still need to do some work to determine the best choice for your situation, but isnโt it great to know we have these tools available?
If youโre ready to dive even deeper into Kafkaโs internals, check out the resources below:
Learn why configuring consumer Group IDs are a crucial part of designing your consumer application. By the end of this post, youโll understand the impact they have on three areas: work sharing, new data detection, and data recovery.
If youโve used Kafka for any amount of time youโve likely heard about connections; the most common place that they come up is in regard to clients. Sure, producer and consumer clients connect to the cluster to do their jobs, but it doesnโt stop there. Nearly all interactions across a cluster...