Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

ksqlDB: The Missing Link Between Real-Time Data and Big Data Streaming

Is event streaming or batch processing more efficient in data processing? Is an IoT system the same as a data analytics system, and a fast data system the same as a big data system? These questions stem from two philosophies of data analytics architecture: state orientation and event orientation. This post will reflect on both concepts and their evolution, as well as highlight the most recent event streaming capabilities found in ksqlDB.

State orientation vs. event orientation

In 1887, Dutch paleoanthropologist Eugène Dubois found the fossil remains of “Java Man” (Anthropopithecus erectus) in Indonesia. He soon proclaimed that he had found the missing link, the evolutionary species that served as the connection between apes and Homo sapiens. Later, new discoveries appeared such as “Lucy” (Australopithecus afarensis)—also thought to be the longed-for link at one point—but the idea of a linear evolution of the human species was eventually abandoned during the 20th century. It was replaced by the theory of a branching tree evolution, where development was believed to be diversified into different species with similar characteristics that coexisted over time.

Within data analysis technologies, also subject to continuous evolution, we can distinguish two main currents that have inspired data architecture: state orientation and event orientation.

Under the first paradigm is the classic three-layer architecture: service-oriented architecture (SOA), microservices, and traditional business intelligence (BI) systems, in which data is exposed in the form of services and BI universes or operational APIs, which are consumed by applications or analysts. In this variant, data is requested (pulled) to know the state of an element or set of elements at a given time.

Under the second paradigm is event-driven architecture (EDA), or fast data, as well as the centralized architectures of IoT and streaming analytics. In this case, the data is delivered (pushed) in the form of events that report changes in state. Jerry Mathew explains these concepts in detail.

Data analysis and architecture complexity

The technological implementation of state orientation versus event orientation varies.

In the case of state orientation, it is based on a database system that acts as a source of truth. This system’s data is exposed to applications through synchronous calls or queries via SQL or an API.

With event orientation, fundamentally the event manager or queue manager allows the applications to store and distribute events in an asynchronous way. In the IoT world, there are also local Data Distribution Service (DDS) technologies that allow for peer-to-peer networks to distribute events, which are very useful in applications that want to avoid centralized dependencies.

If the application requires state orientation and event orientation simultaneously, it is necessary to implement both in the architecture. Any modern IoT or big data platform typically incorporates elements for both event streaming and batch processing with push and pull query mechanisms, at the expense of introducing complexity into the architecture.

The first pre-hominids in databases

This section will focus on the evolution of event streaming technologies and data analytics into new intermediate architectures, which adopt elements of both event streaming and batch processing.

Similar to what some believe are the first pre-hominids like Ardipithecus ramidus, which maintained the characteristics of an ape but began to adopt the first human traits, databases are beginning to develop capabilities for reporting events and vice versa.

  • Databases with SQL push notifications are available. Platforms such as Microsoft SQL Server or Google’s Firebase allow you to send notifications of changes to clients to store a cache of the database. It is very useful for getting mobile applications to work without connectivity.
  • Database replicators are proliferating (e.g., Debezium, Qlik Replicate™, or Oracle GoldenGate). These allow changes to be propagated to another database in the backend. Its usefulness is not in providing high availability but in implementing ETL (extract, transform, load) processes, replicating the data in another destination to apply a change.
  • Event managers adopt SQL syntax. Normally, they incorporate processing engines based on programming languages such as Scala, Python, or Java. To facilitate adoption among SQL developers, they’ll also incorporate programming frameworks in SQL. In IoT, HiveMQ is a SQL manager for MQTT, a popular messaging manager for telemetry, which is already part of the OASIS standards.

ksqlDB: The missing link to event streaming databases

As we continue on this evolutionary path, Confluent, founded by the original co-creators of Apache Kafka®, introduced a new database for event streaming called ksqlDB. This technology turns the well-known streaming SQL engine into a powerful event streaming database with real-time querying functions.

Imagine, for example, an IoT application used to collect changes in the state of a vehicle’s sensors. ksqlDB allows, by means of SQL code, the creation of stream-type tables that register in an immutable way a log of all the events. In addition, it incorporates numerous connectors for automatic synchronization with very diverse sources.

But if we want to know the status of the sensors at a given time, ksqlDB can materialize the previous stream in a view or table, in which the key is the ID of each sensor. Thus, the most up-to-date sensor value appears in different columns. In this way, the application can obtain data via SQL using a classic pull query to know the status of a sensor:

SELECT instant_velocity, current_latitude, current_longitude
FROM sensor_assets
WHERE vehicleid = '19012015' AND sensorid = "01031975";

But it is also possible to launch the query to continuously subscribe to status changes by simply adding the EMIT CHANGES statement:

SELECT instant_velocity, current_latitude, current_longitude
FROM sensor_assets
WHERE vehicleid = '19012015' AND sensorid = "01031975"
EMIT CHANGES;

In this way, the application continuously receives the events of state changes, which allows for alerting or application of a machine learning model.

ksqlDB also enables high performance, scalability, high availability, fault tolerance, and low latencies. Is ksqlDB just another pre-hominid in our technological evolution, or is it the real missing link between big data and fast data systems?

Open source innovation is accelerating exponentially, and ksqlDB is an ongoing effort to simplify your architecture.

Get started with ksqlDB

Ready to check ksqlDB out? Head over to ksqldb.io to get started, where you can follow the quick start, read the docs, and learn more!

A version of this post was originally published by Guillermo Gavilán on the Empresas blog.

  • Guillermo Gavilán is a telecom engineer at Telefónica, where he serves as manager of the BI & Big Data Infrastructures and Data Governance Team. He started out 20 years ago as a developer of intelligent network services, where he acquired experience in real-time systems. His expertise is in telecom networks and IT technologies and architectures, and for the last five years, he has been focusing on big data and analytics. He is also a hobbyist musician and an enthusiastic father.

Did you like this blog post? Share it now