Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
Apache Flink® offers a variety of APIs that provide users with significant flexibility in processing data streams. Among these, the Table API stands out as one of the most popular options. Its user-friendly design allows developers to express complex data processing logic in a clear and declarative manner, making it particularly appealing for those who want to efficiently manipulate data without getting bogged down in intricate implementation details.
At this year’s Current, we introduced support for the Flink Table API in Confluent Cloud for Apache Flink® to enable customers to use Java and Python for their stream processing workloads. The Flink Table API is also supported in Confluent Platform for Apache Flink®, which launched in limited availability and supports all Flink APIs out of the box.
This introduction highlights its capabilities, how it integrates with other Flink APIs, and provides practical examples to help you get started. Whether you are working with real-time data streams or static datasets, the Table API simplifies your workflow while maintaining high performance and flexibility. If you want to go deeper into the details of how Table API works, we encourage you to check out our Table API developer course.
The Apache Flink Table API is a unified relational API for both stream and batch processing. It offers a higher-level abstraction compared to the DataStream API, enabling developers to express complex processing logic in a declarative manner using a fluent API in Java or Python.
The Table API represents data in the form of tables, similar to those in relational databases. These tables can be created from various sources, such as Apache Kafka® topics, files, or other Flink DataStreams. The API allows you to perform operations on these tables, including filtering, joining, and aggregating data.
The Table API is one of several APIs available in the Apache Flink framework, each offering different levels of abstraction and capabilities:
At the lowest level is the ProcessFunction, which is part of the DataStream API. These are primitive building blocks capable of implementing almost any operation by directly manipulating Flink's state backends and timer services. At this level, you write code that reacts to each event as it arrives, one at a time.
The DataStream API, while including ProcessFunction, generally operates at a slightly higher level of abstraction. It provides building blocks like streams and windows, offering more structured ways to process data.
The Table API sits at an even higher level of abstraction. In terms of the abstractions involved, it is roughly equivalent to Flink SQL; however, instead of writing SQL queries, you express your data processing logic using Java or Python code.
At the highest level is Flink SQL, which provides a fully declarative interface for data processing using SQL syntax.
Our fully managed service supports Flink SQL and the Table API, while our on-premises offering supports all Flink APIs.
It's important to note that these APIs are interoperable. You're not limited to choosing just one; a single Flink application can utilize multiple APIs together, leveraging their respective strengths.
The Table API is closely integrated with Flink SQL, allowing you to seamlessly mix Table API and SQL within the same program. This flexibility enables you to use SQL for simpler operations while switching to the Table API when you need more programmatic control.
Moreover, the Table API can interoperate with the DataStream API. You can convert between Tables and DataStreams, which lets you combine the high-level operations of the Table API with the lower-level control of the DataStream API when needed.
In terms of implementation, it's worth noting that the Table API and Flink SQL are two sides of the same coin. They share the same underlying infrastructure, including the optimizer and planner that determine the operator topology. Importantly, the Table/SQL API is not built on top of the DataStream API; instead, both the DataStream API and the Table/SQL API are built on the same internal, low-level stream operator API.
This architecture allows Flink to provide a seamless experience across different levels of abstraction while maintaining high performance and flexibility.
The easiest way to get started with the Table API on Confluent Cloud is to use either the Java examples or the Python examples available on GitHub. When using Flink on Confluent Cloud, you can begin writing your business logic immediately, as all Confluent Cloud metadata is automatically available and ready to use. Table API on Confluent Cloud works with a client-side library that delegates Table API calls to Confluent’s REST API. You don’t need to create JARs or other artifacts with Confluent Cloud.
The Table API is an excellent choice for serverless deployments on Confluent Cloud. It provides a cloud-native experience, allowing you to focus solely on your business logic while Confluent Cloud manages the underlying infrastructure. With automatic scaling and continuous updates, the Table API ensures that your Flink applications run efficiently and securely, without the need to manage clusters or worry about version compatibility.
To start using the Table API, you first need to create a TableEnvironment
. This serves as the entry point for creating tables and executing queries. Here’s a simple example in Java:
This example demonstrates how to create a source table, perform operations using the Table API, and print the results to screen.
The Table API provides a rich set of features and operations that enable you to express complex data processing logic. While many of these features are also available in Flink SQL, the Table API presents them in a programmatic form, allowing for greater flexibility and integration with Java or Python code. Let's explore some of the key features and operations you can perform using the Table API.
One of the key advantages of the Table API is its programmatic nature. While Flink SQL offers a declarative approach using ANSI SQL, the Table API enables you to express complex data processing logic through a fluent API in Java or Python. This provides greater flexibility and control over your data processing pipeline.
Here’s the previous example rewritten in Python:
Unit testing is crucial for ensuring the correctness of your Flink Table API applications. Here’s an example of how to write a unit test for the previous Table API example. In this unit test, we extracted the transaction processing logic into a separate TransactionProcessor
class. This class can then be used in both the main TableApiExample
and the TableApiExampleTest
, ensuring that the same code is being tested and used in production.
The Table API provides a rich set of operations that can be performed on tables, allowing you to transform and analyze your data in various ways. These operations use expressions to manipulate columns and rows. Here are some examples of common operations:
Basic column selection and renaming
Mathematical and string operations
Conditional expressions
Adding a new column based on existing ones
Combining multiple operations
The Table API supports set operations such as union, intersection, and except:
Joins and aggregations
The Table API offers powerful windowing capabilities for both batch and stream processing:
This example creates a tumbling window of 10 seconds based on the ts
column and calculates the sum of the amount
for each id
within each window.
The Table API allows you to extend its functionality by defining custom logic using user-defined functions (UDFs), which are available for early access in Java for our fully managed service. Several types of UDFs are available:
Scalar functions:
These functions take scalar values as input and return a single scalar value.
Example: A function that doubles a value from the amount column.
Table functions (UDTFs):
Table functions transform scalar values into multiple rows as output. They are useful for splitting a single value into multiple rows.
Example: A function that splits a string into multiple rows, one for each word.
Aggregate functions (UDAGGs):
Aggregate functions process multiple input rows to produce a single scalar value. This is useful for operations like summing, averaging, or computing custom metrics across multiple rows.
Example: A function that calculates the cumulative sum of a column for each group of rows.
Table aggregate functions (UDTAGGs):
Table aggregate functions are similar to aggregate functions, but instead of returning a single scalar value, they return multiple rows based on the aggregation.
Example: A function that computes a running top-3 list of values for each group.
Async table functions:
These are specialized functions designed for asynchronous operations, such as performing lookups from an external system in a non-blocking manner.
Example: A function that performs an asynchronous lookup for user details from an external database.
Now, let’s look at an example of a user-defined aggregate function (UDAGG) that tracks the total sum of all previous orders by the same ID. This function is useful for computing running totals over time for each unique ID.
UDAGG example:
Invocation in Table API:
Like Flink SQL, the Table API supports both streaming and batch processing modes. In our fully managed Flink service, batch processing mode will be supported in the coming months. The API itself remains the same whether you're working with bounded (finite) or unbounded (infinite) data. This unified API allows you to write your data processing logic once and apply it to both batch and streaming scenarios.
In streaming mode, the Table API processes continuous streams of data in real time. It supports event time processing and sophisticated windowing operations, making it ideal for scenarios like real-time analytics or continuous ETL.
In batch mode, the Table API processes static datasets. This is suitable for scenarios where you have a finite set of data to process, such as daily or weekly report generation.
The main differences between streaming and batch mode in the Table API are:
Input processing: In streaming queries, the input is processed in real time as it arrives. The entire pipeline runs continuously to handle new data. In contrast, batch queries process finite input data in stages, running only as needed.
Result production: Streaming queries produce continuous results as new data arrives, while batch queries yield a single final result.
Time attributes: In streaming mode, you often work with event time, whereas in batch mode, time attributes are treated as regular columns.
Optimizations: The Flink optimizer may apply different optimizations depending on whether you are in streaming or batch mode.
It's important to note that bounded tables can be processed in both streaming and batch modes, while unbounded tables can only be processed in streaming mode.
To switch between streaming and batch modes, simply change the environment settings when creating your TableEnvironment
:
By offering a unified API for both streaming and batch processing, Flink's Table API provides a flexible and powerful tool for a wide range of data processing scenarios.
The Apache Flink Table API is a powerful tool for both stream and batch processing, offering a unified relational interface that simplifies complex data operations. With its higher-level abstraction compared to the DataStream API, developers can express processing logic clearly using Java or Python, allowing for greater flexibility and control across various data tasks.
By integrating seamlessly with Flink SQL and the DataStream API, the Table API supports a multifaceted approach to data management, enabling users to leverage the strengths of each API as needed. It provides essential features such as windowing capabilities, set operations, and user-defined functions, making it suitable for handling both continuous streams of data and static datasets efficiently.
To explore the Table API further, try our Table API developer course. Additionally, for a hands-on demo, check out our upcoming Q3 2024 Confluent Launch webinar, where you'll learn about the latest Flink enhancements, including Table API support, and other exciting new features for Confluent Cloud. Overall, the Flink Table API equips developers with the tools to build scalable, flexible, and efficient applications, enhancing your ability to manage data workflows effectively.
This blog details an end-to-end real-time prediction project leveraging the combined capabilities of Confluent Cloud stacks and Google Cloud Vertex AI. This project aims to deliver a streamlined solution for real-time prediction applications, catering to the evolving needs and challenges of moder...
With AI model inference in Flink SQL, Confluent allows you to simplify the development and deployment of RAG-enabled GenAI applications by providing a unified platform for both data processing and AI tasks. Learn how you can use it to build a RAG-enabled Q&A chatbot using real-time airline data.