We recently published a series of tutorial videos and tweets on the Apache Kafka® platform as we see it. After you hear that there’s a thing called Kafka but before you put hands to keyboard and start writing code, you need to form a mental model of what the thing is. These videos give you the basics you need to know to have the broad grasp on Kafka necessary to continue learning and eventually start coding. This post summarizes them.
Before we get into it, I should mention that probably the best way to get started with Kafka is to use a fully managed cloud service like, say, Confluent Cloud. When you first sign up, you get $200 of free usage each month for your first three months. If you use the promo code
CL60BLOG, you get an extra $60 of free usage on top of that.*
Now then. Let’s dive in.
Pretty much all of the programs you’ve ever written respond to events of some kind: the mouse moving, input becoming available, web forms being submitted, bits of JSON being posted to your endpoint, the sensor on the pear tree detecting that a partridge has landed on it. Kafka encourages you to see the world as sequences of events, which it models as key-value pairs. The key and the value have some kind of structure, usually represented in your language’s type system, but fundamentally they can be anything. Events are immutable, as it is (sometimes tragically) impossible to change the past.
Because the world is filled with so many events, Kafka gives us a means to organize them and keep them in order: topics. A topic is an ordered log of events. When an external system writes an event to Kafka, it is appended to the end of a topic. By default, messages aren’t deleted from topics until a configurable amount of time has elapsed, even if they’ve been read. Topics are properly logs, not queues; they are durable, replicated, fault-tolerant records of the events stored in them. Logs are a very handy data structure that are efficient to store and maintain, but it’s worth noting that reading them is not too exciting. You can really only scan a log, not query it, so we’ll have to add functionality on a future day to make this more pleasant.
Topics are stored as log files on disk, and disks are notoriously finite in size. It would be no good if our ability to store events were limited to the disks on a single server, or if our ability to publish new events to a topic or subscribe to updates on that topic were limited to the I/O capabilities of a single server. To be able to scale out and not just up, Kafka gives us the option of breaking topics into partitions. Partitions are a systematic way of breaking the one topic log file into many logs, each of which can be hosted on a separate server. This gives us the ability in principle to scale topics out forever, although practical second-order effects and the finite amount of matter and energy available in the known universe to perform computation do place some upper bounds on scalability.
Kafka is distributed data infrastructure, which implies that there is some kind of node that can be duplicated across a network such that the collection of all of those nodes functions together as a single Kafka cluster. That node is called a broker. A broker can run on bare metal hardware, a cloud instance, in a container managed by Kubernetes, in Docker on your laptop, or wherever JVM processes can run. Kafka brokers are intentionally kept very simple, maintaining as little state as possible. They are responsible for writing new events to partitions, serving reads on existing partitions, and replicating partitions among themselves. They don’t do any computation over messages or routing of messages between topics.
As a responsible data infrastructure component, Kafka provides replicated storage of topic partitions. Each topic has a configurable replication factor that determines how many of these copies will exist in the cluster in total. One of the replicas is elected to be the leader, and it is to this replica that all writes are produced and from which all reads are probably consumed. (There are some advanced features that allow some reads to be done on non-leader partitions, but let’s not worry about those here on day five.) The other replicas are called followers, and it is their job to stay up to date with the leader and be eligible for election as the new leader if the broker hosting the current leader goes down.
Once the Kafka cluster is up and running with its minimal feature set, we need to be able to talk to it from the outside world. A producer is an external application that writes messages to a Kafka cluster, communicating with the cluster using Kafka’s network protocol. That network protocol is a publicly documented thing, but it would be an extraordinarily bad idea to write your own interface library when so many excellent ones are available.
Out of the box, Apache Kafka provides a Java library, and Confluent supports libraries in Python, C/C++, .NET languages, and Go. The producer library manages all of the non-trivial network plumbing between your client program and the cluster and also makes decisions like how to assign new messages to topic partitions. The producer library is surprisingly complex in its internals, but the API surface area for the basic task of writing a message to a topic is very simple indeed.
The consumer is an external application that reads messages from Kafka topics and does some work with them, like filtering, aggregating, or enriching them with other information sources. Like the producer, it relies on a client library to handle the low-level network interface in addition to some other pretty sophisticated functionality. A consumer can be just a single instance, or it can be many instances of the same program: a consumer group.
Consumer groups are elastically scalable by default, but the library only manages some of the challenges associated with scale-out and fault tolerance. For example, if your consumer is stateful (and it probably is), then you’ll have some extra work to do to manage that state during failover or scaling operations.
Let’s pause on this day and set the stage for the rest of the series. With basic pub/sub, partitioning, producing, and consuming work in hand, other needs are going to arise. These things consistently emerge from organizations making serious use of Kafka. You need data integration, schema management, and options for stream processing. The Kafka community and Confluent community have solved these problems in standard ways and are likely to continue solving new common problems as they arise.
Kafka Connect is a system for connecting non-Kafka systems to Kafka in a declarative way, without requiring you to write a bunch of non-differentiated integration code to connect to the same exact systems that the rest of the world is connecting to.
Connect runs as a scalable, fault-tolerant cluster of machines external to the Kafka cluster. Rather than write bespoke code to read data from a database or write messages to Elasticsearch, you deploy pre-built connectors from the extensive connector ecosystem, and configure them with a little bit of JSON. Connect then reads data from source systems and writes it to sink systems automatically.
Schema change is a constant fact of life. Any time you serialize data, put it somewhere, and hope to go get it from that place later on, changes in the format of the data are a perennial challenge. We feel this problem most acutely in database schemas, but message formats in Kafka are no exception. The Confluent Schema Registry exists to help manage schema change over time. When you release a new producer or a new consumer application with a modified message format, the Schema Registry will help the client application determine whether the new schema is compatible with the old one, given the expectations of other clients that have yet to be versioned. It’s an indispensable tool for a complex deployment.
Producing messages to Kafka is often fairly simple: Messages come from some source, either read from some input or computed from some prior state, and they go into a topic. But reading gets complicated very quickly, and the consumer API really doesn’t offer much more abstraction than the producer.
The Kafka Streams API exists to provide this layer of abstraction on top of the vanilla consumer. It’s a Java API that provides a functional view of the typical stream processing primitives that emerge in complex consumers: filtering, grouping, aggregating, joining, and more. It provides an abstraction not just for streams, but for streams turned into tables, and a mechanism for querying those tables as well. It builds on the consumer library’s native horizontal scalability and fault tolerance, while addressing the consumer’s limited support for state management.
Writing stream processing applications in Java is a nice thing to do if you’re using Kafka Streams, and if you’re using Java, and if it makes sense to marry stream processing functionality with the application itself. But what if you didn’t want to do those things? Or what if you wanted a simpler approach in which you just used SQL to get your stream processing done?
This is precisely what ksqlDB is: an application-oriented stream processing database for Kafka. A small cluster of ksqlDB nodes runs continuous stream processing queries written in SQL, constantly consuming input events and producing results back into Kafka. It exposes the same stream and table abstractions as Kafka Streams and makes tables queryable through a lightweight JSON API.
I hope this video series has helped you see the breadth of the Kafka ecosystem. I want you to have a basic mental model of how Kafka works and what other components have grown up around it to help you solve the kinds of problems that will inevitably present themselves as you build event-driven systems. If you’ve got some time in the next couple of weeks, and if a bit of study helps you relax as it does for me, be sure to check them out.
Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer advocacy. He can frequently be found at speaking at conferences in the U.S. and all over the world. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to distributed systems, and is the author of Gradle Beyond the Basics. He lives in Littleton, CO, U.S., with the wife of his youth, their three children having grown up.