Apache Kafka® is an enormously successful piece of data infrastructure, functioning as the ubiquitous distributed log underlying the modern enterprise. It is scalable, available as a managed service, and has simple APIs available in pretty much any language you want. But as much as Kafka does a good job as the central nervous system of your company’s data, there are so many systems that are not Kafka that you still have to talk to. Writing bespoke data integration code for each one of those systems would have you writing the same boilerplate and unextracted framework code over and over again. Which is another way of saying that if the Kafka ecosystem didn’t already have Kafka Connect, we would have to invent it.
Kafka Connect is the pluggable, declarative data integration framework for Kafka. It connects data sinks and sources to Kafka, letting the rest of the ecosystem do what it does so well with topics full of events. As is the case with any piece of infrastructure, there are a few essentials you’ll want to know before you sit down to use it, namely setup and configuration, deployment, error handling, troubleshooting, its API, and monitoring. Confluent Developer recently released a Kafka Connect course covering these topics, and in each of the sections below, I’d like to share something about the content of each lesson in the course.
The declarative nature of Kafka Connect makes data integration within the Kafka ecosystem accessible to everyone, even if you don’t typically write code. Kafka Connect’s functionality replaces code that you’d otherwise have to write, deploy, and care for. So instead of your application facing multiple write problems and transaction consistency issues, it can focus on its primary function while Kafka Connect streams its data and writes it independently to another cloud service or data store—or even several—using Confluent’s fully managed connectors for S3, Salesforce, Snowflake, and MongoDB Atlas, for example.
In this video, you’ll see Kafka with self-managed Kafka Connect in action, taking data from a MySQL source via Debezium CDC, into Kafka, and out to Elasticsearch and Neo4j from where some nice visualization is done on the data in near real time.
A Kafka Connect worker’s primary function is to help you get data out of your sources and into your sinks. There are a few internal components that help get this done. Connectors perform the actual interfacing of Kafka with your external data sources, implementing whatever external protocol those data sources and sinks speak and interfacing to Kafka on the other end. Transforms execute stateless functions on events to get them into the right format for the destination system (adding metadata, dropping columns, etc.). Finally, converters serialize or deserialize the data on its way into or out of Kafka. These three components are all extensible, although you will rarely have to write your own code, given that there are hundreds to be found on Confluent Hub, all easily accessed with a task-specific CLI.
The easiest way to use Kafka Connect is through Confluent Cloud’s managed connectors. With check-the-box setup for your data sources and sinks, it’s quick to get started building your data pipelines. But perhaps you can’t use the cloud due to regulatory issues or infrastructure requirements, or the connector that you want to use is only available locally. To deploy your own Kafka Connect instance, you’ll need to choose between standalone and distributed mode. The latter is recommended for nearly all installations—even simple prototypes—since it’s no harder to use, and you will likely need its fault tolerance and horizontal scaling when you reach production. The tutorials in this series also use distributed-mode Kafka Connect in Docker containers.
Containerized Kafka Connect is a streamlined way to get started, and Confluent maintains a Docker base image you can use. You’ll need to add dependencies for your connectors and other components, which you can fetch from a vetted list at Confluent Hub and either bake into a new image or set it to be installed at runtime if you insist.
Data integration is inherently error-prone, relying as it does on connections between things—even if Connect itself never produced a single error. Stack traces and logs will help you with most problems, but you need to know some basics before you wade in. For example, take a running connector with a failed task: if you start your troubleshooting by inspecting the task’s stack trace, then proceed to your Kafka Connect log—you will quickly get an idea of what is causing your problem. And the recently completed Kafka Improvement Proposals (KIPs) addressing dynamic log configuration in Connect as well as additional context for Connect log messages make the job quite a bit more pleasant.
Knowing your options for error handling in Kafka Connect can help you confront the data serialization challenges that are likely to arise as you integrate data from numerous sources. Your choices stretch beyond the shutdown-on-error default. For example, a popular strategy used with sink connectors collects incorrectly serialized messages into a dead letter queue, then reroutes them to eventual success by sending them through a different converter. In other situations, a manual option for processing erroring messages may be needed if the reasons for failure are difficult to identify ahead of time.
Best practices like creating connectors with PUT and fetching a connector’s associated topic names programmatically can improve the administration of your Kafka Connect instance via REST. Then there are little tricks like employing jq for JSON printing and peco for interactive command output filtering that will make your REST interactions a little more sprightly. The Confluent documentation for the Connect REST API (see also the Confluent Cloud documentation when using fully managed connectors) is the best source for detailed command specifics, but an overview of the basics is a good way to begin your journey with Kafka Connect’s powerful API.
If fully managed Kafka Connect in Confluent Cloud is a possibility for you, it’s the most straightforward way to accomplish your integration, as you have to manage zero infrastructure yourself. If you do prefer to self-manage Kafka Connect, then permit me to recommend using Confluent Platform. In both cases, you have extensive graphical monitoring tools that let you quickly gain an overview of your data pipelines. From the overview, you might just drill down into a list of consumers for each pipeline, and then one step deeper into their specifics—seeing real-time statistics like topic consumption information and partition offsets. If you aren’t using Confluent Cloud or Confluent Platform, your options for monitoring Kafka Connect lie in JMX metrics and REST, because your Kafka Connect instance itself exposes extensive information.
Now that you have been introduced to Kafka Connect’s internals and features, and a few strategies that you can use with it, the next step is to experiment with establishing and running an actual Kafka Connect deployment. Check out the free Kafka Connect 101 course on Confluent Developer for code-along tutorials addressing each of the topics above, along with in-depth textual guides and links to external resources.
You can also listen to the podcast Intro to Kafka Connect: Core Components and Architecture, or sign up for the meetup From Zero to Hero with Kafka Connect to learn more about key design concepts of Kafka Connect. During the meetup, you’ll see a live demo of building pipelines with Kafka Connect for streaming data in from databases and out to targets including Elasticsearch.
Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer advocacy. He can frequently be found at speaking at conferences in the U.S. and all over the world. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to distributed systems, and is the author of Gradle Beyond the Basics. He lives in Littleton, CO, U.S., with the wife of his youth, their three children having grown up.