Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
Apache Kafka® is often deployed alongside Elasticsearch to perform log exploration, metrics monitoring and alerting, data visualisation, and analytics. It is complementary to Elasticsearch but also overlaps in some ways, solving similar problems. ksqlDB, the event streaming database purpose-built for stream processing applications, likewise complements the Elasticsearch ecosystem while offering different approaches to handling certain scenarios. ksqlDB and Elasticsearch combined are among the best-of-breed tools to support data enrichment and transformation as well as alerting before ingestion into Elasticsearch where the search and analytics can take place.
Before diving in, perhaps you’re wondering what Elasticsearch even is. Elasticsearch is a distributed search engine/datastore built on top of Apache Lucene™ that can be used for a variety of use cases beyond text search. These use cases can leverage the innovative data structures and features in Apache Lucene, including fast Boolean queries, auto-suggest, geo-point queries, and numeric or time series queries.
Other features of Elasticsearch build on top of rich aggregation capabilities that can group the data according to different criteria (e.g., by region, SKU, and geographic region) and calculate incredibly fast aggregations such as sums, averages, and statistical summaries, or anomaly detection on the fly at query time. The platform powers different use cases, such as enterprise search, application performance monitoring (APM), threat monitoring & detection, and anomaly detection, to name a few.
ksqlDB allows you to write SQL queries on streams from various sources, create derived streams and materialized tables (using push queries), and perform database-like lookups on those materialized tables (using pull queries). Examples of sources include relational databases via change data capture (CDC) connectors, a Twitter feed stream, or sensors sending Internet of Things (IoT) data. The data might need to be filtered or transformed into other shapes for specific applications or downstream systems. To filter, transform, join, or create aggregations from streams of data, you generally have three options:
However, for simple scenarios where we would like to avoid writing Java or Scala, as well as avoid the overhead of provisioning the clusters for the aforementioned stream processing engines, we can use ksqlDB. The ability to use a declarative language like SQL is a significant advantage over lower-level stream processing systems in the data ecosystem. A five-line ksqlDB statement’s equivalent in Kafka Streams or other stream processing frameworks might well be 10 times as long. Since ksqlDB is built on Kafka Streams, it is also scalable and fault tolerant.
ksqlDB applications persist their output to Kafka topics that can then be consumed by applications downstream. These applications might be reacting to some security event like shutting down a user’s access, or they might be populating materialized views in databases or search engines (via Kafka Connect) that then power dashboards or reports.
At face value, ksqlDB contains some similarities to Elasticsearch:
It might help to look at where Kafka and ksqlDB fit into data pipelines with various systems.
Where Elasticsearch is simply used as a large log search or metrics engine, Kafka is often deployed as a high-throughput buffer between data producers and Elasticsearch. In these use cases, Kafka helps with the data extraction process to ensure that the producers don’t overwhelm the Elasticsearch cluster, to provide scalability, and to more loosely couple the producers with Elasticsearch.
However, where Kafka is used as a central nervous system collecting event data from a wide variety of different systems, Elasticsearch may just be one of many downstream systems that stores the raw or transformed data for querying or analytics capabilities such as search or business intelligence (BI). In this case, Elasticsearch acts as yet another materialized view that can drive applications such as dashboards, data exploration, or reporting. Since different downstream applications may need the data to be shaped for their needs, the transformation (the T in ETL) or pattern detection may need to happen earlier and as soon as the data arrives in the pipeline.
On the pattern detection side, Elasticsearch’s fast indexing and querying capabilities allow you to mimic real-time or near-real-time detection or reactions to interesting events even though they are still polling based on the data that is indexed. With polling-based approaches, edge cases that miss events on window boundaries or late-arriving data for lookups are more likely. In polling-based or micro-batching systems, different problems can arise (e.g., dealing with windowing or late arrival of data).
When building an event-driven architecture, ksqlDB shines in these areas:
With these advantages, we can see that Kafka and ksqlDB can be used to build event-driven applications that transform, enrich, and react to data as it arrives. This complements Elasticsearch, where you can send the data for long-term storage and analytics.
Typically, we would deploy Kafka in front of Elasticsearch and use Kafka Connect to push data from selected Kafka topics to Elasticsearch.
Since ksqlDB can configure Kafka connector jobs within ksqlDB itself, this greatly simplifies the process of setting up Connect jobs for sources and sinks (including Elasticsearch).
In fact, you can take data that’s stored in Elasticsearch—whether it was ingested directly or derived from queries—and get it into Kafka as well. For example, your data analysis or machine learning might provide a list of bad actors that you would like to load as a blacklist into ksqlDB for joining access logs in another topic. These might be events that have been transformed or filtered in a certain way using Elasticsearch queries, or it might take the form of curated reference data which could be used to join or enrich data in ksqlDB.
We’ve covered some of the features of ksqlDB and Elasticsearch, as well as highlighted some of the similarities and subtle differences of each when it comes to building real-time data products. With ksqlDB, we are able to enrich and filter data further upstream from Elasticsearch and send the data in a denormalized fashion for easy analysis and reporting. This can be done in real time as data comes in and not after the fact via a batch process or at query time.
This step-by-step demo shows you how to set up data streams and tables using ksqlDB to send data to Elasticsearch for analysis and visualisation in Kibana. If you’d like to get started with the Kafka Connect Elasticsearch Sink Connector, you can read Danny Kay and Liz Bennett’s blog post to learn more.
I hope this has given you a flavour of what is possible with these technologies and inspires you to create more innovative applications using these patterns.
Tableflow can seamlessly make your Kafka operational data available to your AWS analytics ecosystem with minimal effort, leveraging the capabilities of Confluent Tableflow and Amazon SageMaker Lakehouse.
Building a headless data architecture requires us to identify the work we’re already doing deep inside our data analytics plane, and shift it to the left. Learn the specifics in this blog.