Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

KSQL December Release: Streaming SQL for Apache Kafka

Written By

We are very excited to announce the December release of KSQL, the streaming SQL engine for Apache Kafka®! As we announced in the November release blog, we are releasing KSQL on a monthly basis to make it even easier for you to get up and running with the latest and greatest functionality of KSQL to solve your own business problems.

The December release, KSQL 0.3, includes both new features that have been requested by our community as well as under-the-hood improvements for better robustness and resource utilization. If you have already been using KSQL, we encourage you to upgrade to this latest version to take advantage of the new functionality and improvements.

New Features and Improvements

Avro Support and Integration with the Confluent Schema Registry

Since its initial release KSQL has already supported data in JSON and DELIMITED formats. In the past weeks we received many requests from our community to support additional data formats, and Avro has been by far the most requested. We are happy to announce that KSQL now supports data in Avro format through integration with the Confluent Schema Registry, which is part of the Confluent Platform. This means you can now run KSQL queries that read and write Avro data!

KSQL’s Avro support goes even beyond what many community members asked for. Instead of having to manually define Avro schemas, and then mapping those to KSQL’s columns and types in your DDL statements when creating a STREAM or a TABLE, KSQL does something even better: it automatically infers this information from a topic’s associated Avro schema in Confluent Schema Registry, so that you don’t have to deal with the hassle of figuring (and typing) this out yourself.

The following example shows how you can easily create a STREAM or TABLE from Kafka topics with Avro data. Note how you can completely omit the column definitions because KSQL will create the stream or table with the same columns as the fields in the corresponding Avro schema:

 

When working on Avro data, you can of course define columns and their types manually, which you may want to do if you are only interested in a subset of all the available fields in your Avro data.

Easily convert between data formats in real-time

Furthermore, you can now easily convert streams and tables (and, of course, their underlying  topics) between Avro, JSON and delimited formats by writing a single line of KSQL. This functionality is great for real-time ETL use cases.

In the example above, the second line performs the JSON-to-Avro conversion:

 

Again, note how you didn’t need to specify any Avro schema in the data conversion example. As we described in the previous section, KSQL automatically manages schemas for you. This includes registering new Avro schemas automatically with Confluent Schema Registry when needed, and KSQL will of course adhere to any configured schema compatibility settings that you have defined.

Join streams and tables across different data formats

Similarly, you can perform joins between streams and tables in KSQL regardless of the underlying data formats. There’s no special syntax needed; joining different data sources “just works” because KSQL’s internal data model translates automatically between the various data formats for you.

The following example joins a stream with Avro data and a table with JSON data:

 

Metrics and Observability

Additionally, we have taken the first steps to provide metrics and observability in KSQL. This greatly enhances the operability of KSQL, like in cases where you’re monitoring KSQL capacity or when diagnosing issues. You can now see different metrics for streams, tables, and queries for every KSQL server instance.

For streams and tables, we now have DESCRIBE EXTENDED <stream/table name> statement to show statistics, such as number of messages processed per second, total messages, the time when the last message was received, as well as corresponding failure metrics.

DESCRIBE EXTENDED example:

 

For queries, we improved the EXPLAIN <query_id> statement to show both the query execution plan and the stream application’s topology for the query along with its message processing rate, total processed messages, the time when the last message was processed, as well as failure metrics such as serialization/deserialization errors.

EXPLAIN example:

 

While the 0.3 release has laid the groundwork for observability features in KSQL, we will be adding further functionality in the upcoming releases.

Improved KSQL Server Startup

Lastly, KSQL servers start up faster now and have better resource utilization. For example, when a new server instance joins an existing pool of servers, or when a failed server recovers, it detects all terminated queries in its history and avoids starting and stopping processing topologies for those queries.

Thanks to contributors and community members!

The year 2017 is drawing to a close, and we’d like to take this opportunity to give a shout-out to all of you who have contributed to KSQL thus far, be it in the form of feature requests, bug reports, asking or answering questions, code contributions, or participating in our KSQL beta program. If you, too, are interested in joining the beta, please reach out to us!

There’s much more on the horizon for KSQL in 2018, and we’re looking forward to collaborating with you to make KSQL the easiest and best tool to process data in Kafka.

Where to go from here

If you have enjoyed this article, you might want to continue with the following resources to learn more about KSQL:

  • Hojjat is the founder and CEO of DeltaStream, a serverless database to manage, secure and process all your streams on cloud. Before starting DeltaStream, he was at Confluent where he created ksqlDB, a database purpose-built for stream processing applications from Confluent. Prior to Confluent, he worked at NEC Labs, Informatica, Quantcast and Tidemark on various big data management projects. He has a Ph.D. in computer science from UC Irvine, where he worked on scalable stream processing and publish/subscribe systems.

Did you like this blog post? Share it now