Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
Apache Kafka® is often deployed alongside Elasticsearch to perform log exploration, metrics monitoring and alerting, data visualisation, and analytics. It is complementary to Elasticsearch but also overlaps in some ways, solving similar problems. ksqlDB, the event streaming database purpose-built for stream processing applications, likewise complements the Elasticsearch ecosystem while offering different approaches to handling certain scenarios. ksqlDB and Elasticsearch combined are among the best-of-breed tools to support data enrichment and transformation as well as alerting before ingestion into Elasticsearch where the search and analytics can take place.
Before diving in, perhaps you’re wondering what Elasticsearch even is. Elasticsearch is a distributed search engine/datastore built on top of Apache Lucene™ that can be used for a variety of use cases beyond text search. These use cases can leverage the innovative data structures and features in Apache Lucene, including fast Boolean queries, auto-suggest, geo-point queries, and numeric or time series queries.
Other features of Elasticsearch build on top of rich aggregation capabilities that can group the data according to different criteria (e.g., by region, SKU, and geographic region) and calculate incredibly fast aggregations such as sums, averages, and statistical summaries, or anomaly detection on the fly at query time. The platform powers different use cases, such as enterprise search, application performance monitoring (APM), threat monitoring & detection, and anomaly detection, to name a few.
ksqlDB allows you to write SQL queries on streams from various sources, create derived streams and materialized tables (using push queries), and perform database-like lookups on those materialized tables (using pull queries). Examples of sources include relational databases via change data capture (CDC) connectors, a Twitter feed stream, or sensors sending Internet of Things (IoT) data. The data might need to be filtered or transformed into other shapes for specific applications or downstream systems. To filter, transform, join, or create aggregations from streams of data, you generally have three options:
However, for simple scenarios where we would like to avoid writing Java or Scala, as well as avoid the overhead of provisioning the clusters for the aforementioned stream processing engines, we can use ksqlDB. The ability to use a declarative language like SQL is a significant advantage over lower-level stream processing systems in the data ecosystem. A five-line ksqlDB statement’s equivalent in Kafka Streams or other stream processing frameworks might well be 10 times as long. Since ksqlDB is built on Kafka Streams, it is also scalable and fault tolerant.
ksqlDB applications persist their output to Kafka topics that can then be consumed by applications downstream. These applications might be reacting to some security event like shutting down a user’s access, or they might be populating materialized views in databases or search engines (via Kafka Connect) that then power dashboards or reports.
At face value, ksqlDB contains some similarities to Elasticsearch:
AVGand also has the ability to hook in user-defined aggregation functions (UDAFs) to do the same.
SQL WHEREclauses to create derived streams. In addition, connectors defined via
CREATEconnectors can be configured with Single Message Transforms and type converters to reformat or drop data.
CSAS) is the basis for querying future data arriving into Kafka for a given criteria.
It might help to look at where Kafka and ksqlDB fit into data pipelines with various systems.
Where Elasticsearch is simply used as a large log search or metrics engine, Kafka is often deployed as a high-throughput buffer between data producers and Elasticsearch. In these use cases, Kafka helps with the data extraction process to ensure that the producers don’t overwhelm the Elasticsearch cluster, to provide scalability, and to more loosely couple the producers with Elasticsearch.
However, where Kafka is used as a central nervous system collecting event data from a wide variety of different systems, Elasticsearch may just be one of many downstream systems that stores the raw or transformed data for querying or analytics capabilities such as search or business intelligence (BI). In this case, Elasticsearch acts as yet another materialized view that can drive applications such as dashboards, data exploration, or reporting. Since different downstream applications may need the data to be shaped for their needs, the transformation (the T in ETL) or pattern detection may need to happen earlier and as soon as the data arrives in the pipeline.
On the pattern detection side, Elasticsearch’s fast indexing and querying capabilities allow you to mimic real-time or near-real-time detection or reactions to interesting events even though they are still polling based on the data that is indexed. With polling-based approaches, edge cases that miss events on window boundaries or late-arriving data for lookups are more likely. In polling-based or micro-batching systems, different problems can arise (e.g., dealing with windowing or late arrival of data).
When building an event-driven architecture, ksqlDB shines in these areas:
With these advantages, we can see that Kafka and ksqlDB can be used to build event-driven applications that transform, enrich, and react to data as it arrives. This complements Elasticsearch, where you can send the data for long-term storage and analytics.
Typically, we would deploy Kafka in front of Elasticsearch and use Kafka Connect to push data from selected Kafka topics to Elasticsearch.
Since ksqlDB can configure Kafka connector jobs within ksqlDB itself, this greatly simplifies the process of setting up Connect jobs for sources and sinks (including Elasticsearch).
In fact, you can take data that’s stored in Elasticsearch—whether it was ingested directly or derived from queries—and get it into Kafka as well. For example, your data analysis or machine learning might provide a list of bad actors that you would like to load as a blacklist into ksqlDB for joining access logs in another topic. These might be events that have been transformed or filtered in a certain way using Elasticsearch queries, or it might take the form of curated reference data which could be used to join or enrich data in ksqlDB.
We’ve covered some of the features of ksqlDB and Elasticsearch, as well as highlighted some of the similarities and subtle differences of each when it comes to building real-time data products. With ksqlDB, we are able to enrich and filter data further upstream from Elasticsearch and send the data in a denormalized fashion for easy analysis and reporting. This can be done in real time as data comes in and not after the fact via a batch process or at query time.
This step-by-step demo shows you how to set up data streams and tables using ksqlDB to send data to Elasticsearch for analysis and visualisation in Kibana. If you’d like to get started with the Kafka Connect Elasticsearch Sink Connector, you can read Danny Kay and Liz Bennett’s blog post to learn more.
I hope this has given you a flavour of what is possible with these technologies and inspires you to create more innovative applications using these patterns.
An Approach to combining Change Data Capture (CDC) messages from a relational database into transactional messages using Kafka Streams.
Change data capture (CDC) converts all the changes that occur inside your database into events and publishes them to an event stream. You can then use these events to power analytics, drive operational use cases, hydrate databases, and more. The pattern is enjoying wider adoption than ever before.