A data pipeline is a method for getting data from one system to another, whether for analytics purposes or for storage. Learning the elements that make up this proven architecture can help you form a mental blueprint that will be useful when you go to design your own data systems.
A pipeline can be as simple as a single database and an analytics dashboard, or far more complex, with numerous data sources sending to a central processing layer, where the data is joined and manipulated—then queried from there by an analytics layer. Streaming is an attribute often added to data pipelines, and it means sending data from sources to targets as events happen. Streaming brings the advantages of accurate and relevant data, and it helps to avoid the bottlenecks that can arise from huge influxes of data. (However, by definition, data pipelines don’t have to be streaming.)
A proven option for the central processing layer in a complex data pipeline is Apache Kafka®, usually augmented by ksqlDB or Kafka Streams for data manipulation. Kafka mixes well with the decoupled nature of pipelines since data is preserved in it, replayable at any time. This is critical, for example, if an existing system element needs to be restored, or a new one needs to be brought online. In turn, ksqlDB serves as the processing layer, where you can perform operations on your data like denormalizing, filtering, aggregating, running machine learning models, etc.—before sending it along to its target(s).
This blog post will describe the elements involved in setting up a Kafka-based data pipeline: connecting data entities together; streaming data from source(s) into the middle of the pipeline—Kafka; filtering, joining, and enriching the data with ksqlDB; and finally channeling data out to Kibana on Elasticsearch. After reading this post, make sure to visit the extensive Confluent Developer course covering the concept, Data Pipelines 101.
The most efficient way to wire up streaming data ingest and egress in a Kafka data pipeline is to use Kafka Connect, which you can run as a managed, easily configured tool on Confluent Cloud, or directly on your own server (which you might want to do if a particular connector isn’t available on Confluent Cloud or you’re not using the Cloud). Kafka Connect lets you join myriad systems with Kafka: relational databases such as Oracle, Postgres, and MySQL; message queues like ActiveMQ, IBM MQ, and JMS providers; cloud data warehouses like Snowflake and BigQuery; NoSQL databases like MongoDB and Cassandra, or SaaS platforms like Salesforce. If you are not using the connectors on Confluent Cloud, you can find connectors on the Confluent Hub.
Change data capture lets you subscribe to the changes in a source database after taking a full snapshot of it, which is an effective way to get data from the beginning of your pipeline into the middle, where Kafka resides. You have two options for implementing it:
The query method simply polls the source database on a regular interval using a query predicate based on a timestamp or a monotonically incrementing identifier, or both. Query-based CDC is usually quicker to set up than log-based CDC (provided that you can modify source schemas), but it comes with several limitations: It continuously hits your database, you can’t track DELETEs, and you can only track the latest changes (keep in mind that multiple actions could happen in the time between polls, so not all actions may get captured.) An example of query-based CDC is the Kafka Connect JDBC Source Connector.
Conversely, log-based CDC reads the transaction log (aka redo log, binlog, etc.) of the source database, a binary file. The advantage of the log-based approach over the query-based one is that it captures all changes, including DELETEs, and so transmits a complete history of a database. In addition, log-based CDC captures everything with both lower latency and lower impact on your database. The downside of log-based CDC is that it takes longer to set up and generally requires greater system privileges. Connectors that use the log-based approach are the Confluent Oracle CDC connector and all of the Debezium project connectors.
Once you have established your data flow, the next step is to add value to the data traveling through your pipeline by processing it in some way—for example, by filtering or enriching it. You can use Kafka Streams for this if you are experienced with Java and good at deploying it. However, ksqlDB is an option for anyone, regardless of programming language expertise, as it enables stream processing programs to be expressed in simple SQL. Using ksqlDB, you could quickly apply a filter, for instance, to remove all events that contain a specific value in a field—such as the word “test.”
For most data pipelines that you’ll build, you’ll need to join data sources to some degree. For example, you may want to join customer ratings from a website with customer data such as phone and email (to take the example used in Data Pipelines 101).
Your customer data would likely reside in an external database, and would therefore first need to be pulled into Kafka via Kafka Connect so that a join can be performed. Additionally, to use this data as a lookup source, you would need to convert it into a stateful table from the stream form that it arrives in from the external database. (In ksqlDB, streams can easily be converted to tables and vice versa.) Conversely, the customer ratings would likely come from a REST endpoint, a mobile app, or a web frontend, etc. Once your data is in the correct form, you are able to enrich each customer rating as it arrives with customer data from the table by looking up its key, and you can send these to a new stream.
Once you’ve successfully placed data into your pipeline and have processed it, a final step is to create a sink to receive your data for analytics purposes, or for storage. As mentioned above, you can use Kafka Connect just as easily for data egress as you can for data ingest. On Confluent Cloud, you’ll find connectors for most of the places you’re likely to want to send your data to, including object stores, cloud data warehouses, and NoSQL stores. For analytics, the fully managed Elasticsearch connector on Confluent Cloud is a sturdy option, and as with all of the other managed connectors, setup only requires that you to fill in a few blanks on the Confluent Cloud Console.
Now that you’ve learned the basic elements of a data pipeline, you can practice setting up a complete pipeline by following the cumulative exercises in the Data Pipelines 101 course. In the example scenario, you’ll join mock customer ratings arriving from a theoretical web application to customer data pulled from a mock MySQL database. Your data will ultimately be channeled to Kibana on Elasticsearch. Over the course of the exercise, you’ll learn Kafka Connect along with ksqlDB filters, joins, and tables—and more.
Begin by setting up your environment in the Kafka ecosystem, including Kafka, Kafka Connect, Confluent Schema Registry, and ksqlDB, on Confluent Cloud. Note that if you use the promo code PIPELINES101, you’ll get $101 of free Confluent Cloud usage.
Decoupling source and target in a data pipeline based around Kafka has many advantages. As alluded to earlier, if a side goes offline, the pipeline is not impacted: If the target goes down, Kafka still holds the data and can resume sending it to the target when it is back; if the source goes down, the target will not even notice other than the fact that there is no data; finally, if the target can’t keep up with the source, Kafka can absorb some of the backpressure. Data pipelines also evolve well because the data is held in Kafka and can be sent to any number of targets independently.
To learn even more about data pipelines, make sure to: