Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Build Streaming Apps Quickly with Flink SQL Workspaces

Written By

At this year’s Current, we introduced the public preview of our serverless Apache Flink® service, making it easier than ever to take advantage of stream processing without the complexities of infrastructure management. This first iteration of the service offers the Flink SQL API, which adheres to the ANSI standard and enables any user familiar with SQL to use Flink.

As SQL has become ubiquitous, so have SQL-based interfaces that simplify data exploration, enrichment, and analysis. In this blog post, we will provide an overview of Confluent's new SQL Workspaces for Flink, which offer the same functionality for streaming data that SQL users have come to expect for batch-based systems. We will also discuss how SQL Workspaces is fully integrated into the rest of Confluent Cloud, providing a simple and unified experience.

If you’re interested in seeing Flink and SQL Workspaces in action, be sure to register for the upcoming Flink webinar where we'll be showcasing a deep-dive demo of this feature and more.

And with that, let's start by briefly exploring how SQL has transformed the way we interact with applications and streaming data.

From batch to streaming SQL

In 1969, Edgar Frank Codd invented relational algebra, which allowed applications and data to evolve independently. His ideas were developed by IBM researchers into IBM's System R, eventually becoming SQL.

SQL expanded the potential of applications and data by optimizing data access through a declarative formalism for interrogating data, allowing for data model evolution and different sorting orders without requiring application changes, and providing answers as long as questions were formulated correctly. Today, SQL is a ubiquitous concept and elegant user interface that transforms digital information into questions, behaviors, predictions, and memories.

As data shifts from batch to real-time streaming, SQL remains critical. Confluent has approached the user experience around SQL by taking advantage of widespread SQL expertise within organizations. If you already know SQL, using Flink on Confluent Cloud will feel very natural for you to use. Here’s how Flink SQL works on Confluent Cloud.

Introducing SQL Workspaces

SQL Workspaces is the primary, browser-based option for leveraging SQL to interact with your data in Confluent Cloud. All you need to start using Workspaces is a Flink compute pool to execute your SQL statements. Once you have a compute pool, you can start writing SQL against all of your data within an entire region. If you've ever used a SQL-based UI before, your first impression of SQL Workspaces should feel familiar. Before you even begin writing SQL, you need to know which relations you have at your disposal, and what kind of data they contain. 

The navigator on your left will make this easy for you. It's integrated with all of your Kafka data throughout an entire cloud region. You'll notice that there are four levels of hierarchy in the navigator. The top level shows your Confluent Cloud environments. Clicking into an environment will bring you to the next level of hierarchy: your Kafka clusters associated with that environment. Underneath your clusters are the topics in the cluster, and beneath each topic is its schema–field names:

The navigator on the left is a hierarchical, collapsible/expandable tree representing a given catalog’s entire schema

Let's dive a bit deeper briefly, because there is a very big idea here. Confluent Cloud environments are not a Flink abstraction, so why are we showing them to you in the context of a Flink SQL editor? The answer is that we aren't—we're showing you Flink catalogs, because Flink interprets environments as catalogs. It interprets Kafka clusters as databases, and topics as tables:

Confluent Cloud

Flink

Environment

Catalog

Kafka cluster

Database

Kafka topic

Table

The astute reader may have caught on by now, but what this means is that Flink already understands your Confluent Cloud resources. It is not necessary to do anything Flink-specific to make your Kafka data accessible with SQL. As long as your topics have schemas, Flink will interpret them as tables out of the box. You are free to start writing SQL as the very first thing you do when you arrive at the SQL Workspaces UI. Given that a workspace may operate over data throughout an entire region–including referencing multiple Kafka clusters from a single SQL statement–this places a great deal of power at your fingertips. If you can describe something with SQL, you can do it with Flink on Confluent Cloud, all without ever leaving a workspace.

Don't worry about saving your SQL statements either–we do that for you. Workspaces are auto-saved, so when you come back after logging out, you can always pick up where you left off. You may also want to keep a workspace around that contains common SQL statements you need to run. While workspaces are currently only visible to their creator and organization admins, you'll soon be able to share them with your co-workers for collaboration.

While SQL as a language hasn't had to fundamentally change to adapt to the world of streaming, certain semantics have. We've made a number of conscious design decisions to account for this, taking great care to strike an intuitive balance for a well-established SQL development loop that has been thrust into the context of the data streaming era.

A new era for streaming SQL

Perhaps the most obvious semantic change that streaming brings is that streaming SQL statements may never end. They may run and run, updating their results incrementally in real time as they consume their input. SQL Workspaces is natively designed around this concept. When a streaming SQL statement is run, your workspace will present the output continuously in real time. If a statement's output is being updated in place, such as for an aggregation, you'll see soft visual cues indicating which rows were recently updated.

Querying a dynamic table and sending the result of the continuous query to another dynamic table

Perpetual SQL statements also create interesting UX challenges around the development loop itself. For example, if I'm developing a streaming SELECT statement, and I want to quickly verify that its output is updated in real time as I manually INSERT rows into its input, how can I do that from a single SQL editor if the SELECT blocks indefinitely? I could pre-populate the input table with some rows, but that's not quite what I'm looking for. I want to be certain that it's processing the exact rows that I'm inserting, and I want to be certain that it's doing so in real time. I could open another browser window and run the SELECT and INSERT side by side, but that would be tedious.

The way SQL Workspaces solves this problem is by simply allowing you to have multiple SQL editors in the same view. This pattern enables you to run a SQL statement in one editor, and write data into its input directly above or beneath it while both are visible. Multiple editors per workspace has many additional benefits as well: you can use a single workspace to persist multiple statements that you commonly run, you can use multiple editors as a development log as you iterate on a complex statement piece-by-piece, or simply use them as a way to group together related statements and results that tell a story.

Multiple editors allow for easy access to frequently used statements and organized presentation of related statements and results

Metadata such as runtime metrics are also fundamentally different in a streaming context. If you need to collect statistics about the runtime characteristics of a batch query, such as how many rows it scanned, that is typically straightforward. You just need to run the query and collect statistics afterwards. Obviously this cannot work the same way in the streaming world, for reasons we've covered. It becomes necessary to analyze runtime characteristics while a streaming statement is running. SQL Workspaces gives you this information inline, in real time. Clicking into a statement while it's running will pop out a runtime panel displaying key metrics and metadata that will help you evaluate the health and performance of a given statement in real time.

SQL Workspaces offers a runtime panel in real time to evaluate the health and performance of a streaming statement as it's running

The SQL Workspaces UI is ultimately designed to give you everything you need to write SQL against your Kafka data, without asking you to context switch. It also takes this further by spanning across core elements of the Confluent Cloud UI, offering helpful, context-based entry points designed to make your life easier. In other words, SQL Workspaces closely integrates your Flink and Kafka experiences together.

Integration of SQL Workspaces with Confluent Cloud

Confluent Cloud offers an expansive, powerful set of composable tools that serve as building blocks for all of your data streaming use cases. The true beauty of Confluent Cloud is not derived specifically from Kafka, or Schema Registry, or Flink, or any individual component. Its value is derived from how seamlessly all of these components harmonize with one another, giving you the freedom to work with streaming data as you see fit and serve the needs of your organization. SQL Workspaces makes itself a part of this harmony, offering its services wherever it can be of help.

If you happen to be inspecting a Kafka topic, it may occur to you that it would be useful to run a SQL statement over that topic's data, perhaps at a specific moment. SQL Workspaces makes this possible by providing a contextual entry point embedded within the topic view. Clicking into this entry point will automatically navigate you to the Workspaces UI, and will even pre-populate your SQL editor with a valid, runnable statement against the target topic. That is, from the topic view, you are only a couple clicks away from issuing an arbitrary SQL statement against that topic. One click to navigate to Workspaces, another click to run the statement. The result of the statement will be the next thing you see.

Similar navigational flows will also be available within Data Portal (coming soon), which itself is a highly integrated view of your entire Confluent Cloud estate. From here, you may click into topics and jump straight into a SQL Workspace to query them, or search through your actual workspaces to quickly find the one you're looking for.

From the Data Portal, you can click on a Kafka topic and open a new "SQL Workspace" by clicking "Process"

The contextual entry points embedded within the topic view and Data Portal make it easy to access and query Kafka data with just a few clicks. This streamlined and intuitive way of exploring and analyzing streaming data empowers organizations to unlock the full potential of their data.

Unlock the full potential of your data

SQL has come a long way since Codd’s specification of the relational model in 1969. It has revolutionized the way we interact with data and applications, providing a comprehensive, methodical blueprint for a model that has vastly expanded the potential of both. As the world's data continues to shift towards a real-time streaming model, SQL has kept its utility, necessitating only incremental evolution to adapt to the changing landscape.

Confluent's SQL Workspaces is a prime example of how SQL has adapted to the streaming world. It offers a browser-based option for leveraging SQL to interact with data in Confluent Cloud. With its seamless integration with other components of Confluent Cloud, SQL Workspaces allows you to write SQL against your Kafka data without having to context switch. It also provides helpful, context-based entry points designed to make your life easier.

If you're interested in learning more, don't forget to register for the upcoming Flink webinar! You'll have the opportunity to get hands-on with a technical demo that showcases the full capabilities of Flink and SQL Workspaces on Confluent Cloud.

  • Derek Nelson is a senior product manager at Confluent. He was previously the founder of PipelineDB, a database technology startup that built a SQL engine for stream processing.

Did you like this blog post? Share it now