Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
The Apache Flink® community unveiled Apache Flink version 1.19 this week! This release is packed with numerous new features and enhancements. In this blog post, we'll spotlight some of the standout additions. For a comprehensive rundown of all updates, don't forget to review the release notes.
The release of Apache Flink 1.19 marks another step forward in stream processing technology. It includes various features and improvements aimed at enhancing the system's reliability and flexibility, while also setting the groundwork for the upcoming Flink 2.0. This version focuses on significant Flink Improvement Proposals (FLIPs) and other contributions, demonstrating how Flink 1.19 contributes to advancing stream processing and supports the development of more dynamic, efficient, and user-focused data streaming applications.
Apache Flink has become more user-friendly and adaptable thanks to FLIP-366: Support standard YAML for FLINK configuration, which integrates standard YAML support for configurations. The default configuration file has changed to config.yaml
instead of flink-conf.yaml
. The latter can still be used and takes precedence, but will be removed in Flink 2.0. This shift toward a more familiar and widely used syntax simplifies configuration processes, making them more intuitive and less prone to errors.
The move not only enhances user experience but also aligns Flink with broader industry practices, ensuring that new and existing users can more easily adapt their configurations for optimal performance.
The introduction of dynamic source parallelism inference, as proposed in FLIP-379: Dynamic source parallelism inference for batch jobs, marks a significant advancement in Flink's ability to manage resources more effectively when executing batch jobs. Source connectors can now implement the new interfaces, which is already done for Flink’s FileSource connector. This feature allows Flink to automatically adjust the parallelism of source tasks based on the workload, improving resource utilization and reducing the need for manual intervention. If necessary, users can still control the maximum limit or the highest value that the source parallelism inference can reach. The configuration execution.batch.adaptive.auto-parallelism.default-source-parallelism
sets this limit. If this configuration is not specified, the system will look for the limit set by execution.batch.adaptive.auto-parallelism.max-parallelism
. If neither is set, it defaults to the general parallelism settings. This dynamic adjustment is particularly beneficial in cloud environments, where scalability and efficient resource use are paramount. By optimizing resource allocation in response to real-time data volumes, Flink 1.19 ensures that applications can scale more gracefully, maintaining high performance even under fluctuating loads.
Flink 1.19 introduces several enhancements to its SQL and Table API that significantly improve its capabilities and ease of use. For instance, FLIP-367: Support Setting Parallelism for Table/SQL Sources enables users to specify custom parallelism for Table/SQL sources directly in their queries, offering a level of control that fine-tunes performance and resource management.
The following Datagen DDL statement allows you to directly set the scan parallelism to 4:
You could also provide this configuration on a previously defined table by using a Dynamic Table option.
Moreover, the capability to configure different state time-to-live using SQL hints (FLIP-373: Support Configuring Different State TTLs using SQL Hint) introduces a nuanced approach to managing stateful operations. This flexibility allows developers to specify TTL values directly within their SQL queries, making it easier to optimize state management and resource usage based on the requirements of each specific application scenario.
Take a regular streaming join for example. In order to produce semantically correct results, Flink needs to keep both sides of the join input in Flink state forever. Thus, the required state for computing the query result might grow infinitely depending on the number of distinct input rows of all input tables and intermediate join results. If you are willing to accept the risk of incorrect or incomplete results in order to limit the size of your state, you can now set a state TTL for the individual tables in your query.
Flink 1.19 supports the use of SESSION Window Table-Valued Functions (TVF) in streaming mode, enabling more flexible windowing operations that group events into sessions based on activity gaps. This feature allows for intricate analyses of data patterns over time, with the ability to partition data streams by specific attributes. Here's a straightforward example to illustrate the usage:
This example showcases how users can now group and aggregate streaming data into dynamic sessions based on activity, providing deeper insights into the temporal patterns within their data.
Support for changelog inputs for Window TVF Aggregation marks a pivotal advancement in how window aggregation operators handle streams of data, particularly beneficial for changelog streams such as those originating from change data capture (CDC) data sources. This enhancement ensures seamless processing of changelog (retracting or upserting) datasets.
FLIP-400's introduction of the AsyncScalarFunction interface allows for asynchronous execution of scalar functions. Traditionally, Flink's UDFs, such as ScalarFunction, execute operations synchronously. This approach works well for computations that are CPU-bound but falls short when dealing with operations that involve waiting for external systems, such as network calls or database queries. In such cases, the synchronous execution model can significantly throttle throughput and degrade overall system performance due to the sequential processing of calls, which introduces idle time waiting for I/O operations to complete.
The AsyncScalarFunction interface allows for asynchronous execution of scalar functions. This means that operations that would typically block the main processing thread waiting for a response can now be executed in parallel, allowing the main thread to proceed with other tasks. This concurrency is particularly advantageous for applications that frequently interact with slow external systems or services.
Apache Flink 1.19 introduces significant improvements to its Sink APIs through FLIP-371: Provide initialization context for Committer creation in TwoPhaseCommittingSink and FLIP-372: Enhance and synchronize Sink API to match the Source API, aiming to provide connector developers such as Apache Iceberg, with better interfaces to integrate with.
FLIP-371 solves the inability to emit metrics from the committer in the context of the previous releases of Flink's Sink API. This enhancement now allows for the inclusion of commit-related metrics such as commit duration, and the size and number of committed data files. Providing a richer initialization context for committers enables developers to gain deeper insights into the data commit process, facilitating improved monitoring and performance optimization of their streaming applications.
FLIP-372 tackles the limitations of the WithPreCommitTopology
interface, which originally restricted the transformation and management of committable messages, particularly in scenarios that required the aggregation of writer results into a single committable entity. The interface’s previous design limited data type transformations and adjustments post-aggregation, posing challenges for developers aiming for efficient data persistence. As a result, the Sink API has been adjusted using mixin interfaces to enhance the extendibility of the API.
Addressing the challenge of record amplification in cascading joins, FLIP-415 introduces a new join operator that supports minibatch processing. This innovation significantly reduces the overhead associated with join operations, especially in complex data pipelines, by minimizing the generation of intermediate results. Such optimization boosts performance but also contributes to more efficient resource utilization.
The introduction of named parameters for functions and procedures (FLIP-387) represents another stride toward improving SQL usability and flexibility. This feature simplifies the construction of SQL queries by allowing parameters to be specified by name, reducing the likelihood of errors and making the code more readable and maintainable. All the details can be found in the Apache Flink documentation on named parameters. This enhancement is particularly beneficial in complex queries where the order of parameters can be confusing.
With FLIP-375: Built-in cross-platform powerful Java profiler, Flink 1.19 integrates a powerful Java profiler directly accessible from the Flink Web UI. This built-in profiler enables users to diagnose and analyze performance bottlenecks at the JobManager and TaskManager levels more conveniently. By providing detailed insights into the runtime behavior of streaming applications, this feature empowers developers and administrators to pinpoint and address performance issues more effectively, ensuring smoother and more reliable operations.
Another observability enhancement is FLIP-384, which lays the foundation by introducing the TraceReporter
interface. This feature addresses the previously limited visibility into Flink's checkpointing and recovery mechanisms. By enabling the reporting of traces or spans for these critical processes, FLIP-384 not only broadens the scope of monitoring capabilities but also paves the way for future enhancements in job submission, state changes, and more.
Building on this foundation, FLIP-385: Add OpenTelemetryTraceReporter and OpenTelemetryMetricReporter takes a significant step forward by integrating OpenTelemetry further into Flink. This integration provides a robust and scalable solution for tracing and metrics reporting, making it easier than ever for users to gain detailed insights into their streaming applications' behavior.
Last but not least, FLIP-386: Support adding custom metrics in Recovery Spans introduces the ability to attach custom metrics to recovery spans. This addition allows for a more granular analysis of recovery operations, enabling developers and administrators to monitor specific attributes, such as data download times and ingestion rates.
Together, these FLIPs represent further advancement in Flink's observability capabilities. By offering a cohesive and powerful suite of tools for monitoring checkpointing and recovery processes, and by integrating with OpenTelemetry for advanced tracing and metrics reporting, Flink 1.19 significantly enhances the ability of developers and administrators to diagnose, analyze, and optimize their streaming applications.
Apache Flink 1.19 introduces beta support for Java 21, enabling users to compile and run the framework using the latest Java LTS version. This update facilitates access to the newest Java features and improvements, aiming to improve application performance and security while maintaining compatibility with current Java standards.
Flink 1.19 serves as a foundational step toward the release of Flink 2.0. By introducing critical deprecations and laying down the groundwork for future enhancements, this version prepares the community for the next major milestone in Flink's development. The move toward a more standardized and simplified configuration mechanism, alongside the emphasis on dynamic resource management and API usability, reflects the project's commitment to meeting the evolving needs of its user base and adapting to the ever-changing landscape of data processing technologies.
The development and refinement of Apache Flink are deeply rooted in community collaboration and open source principles. The contributions of over 160 individuals to Flink 1.19 underscore the collective effort and dedication that fuel the project's growth and innovation.
Apache Flink 1.19 is the next step in the project's pursuit of excellence in unified stream and batch processing, introducing features and improvements that address critical areas of configuration, performance, usability, and monitoring. These advancements not only enhance Flink's capabilities but also set the stage for the transformative developments anticipated in Flink 2.0.
Our appreciation goes to the contributors whose vision, expertise, and hard work have shaped this release. The Apache Flink community thrives on all contributions, and they continue to redefine the possibilities of data streaming technologies. We would like to express gratitude to all the contributors who made this release possible:
Adi Polak, Ahmed Hamdy, Akira Ajisaka, Alan Sheinberg, Aleksandr Pilipenko, Alex Wu, Alexander Fedulov, Archit Goyal, Asha Boyapati, Benchao Li, Bo Cui, Cheena Budhiraja, Chesnay Schepler, Dale Lane, Danny Cranmer, David Moravek, Dawid Wysakowicz, Deepyaman Datta, Dian Fu, Dmitriy Linevich, Elkhan Dadashov, Eric Brzezenski, Etienne Chauchot, Fang Yong, Feng Jiajie, Feng Jin, Ferenc Csaky, Gabor Somogyi, Gyula Fora, Hang Ruan, Hangxiang Yu, Hanyu Zheng, Hjw, Hong Liang Teoh, Hongshun Wang, HuangXingBo, Jack, Jacky Lau, James Hughes, Jane Chan, Jerome Gagnon, Jeyhun Karimov, Jiabao Sun, JiangXin, Jiangjie (Becket) Qin, Jim Hughes, Jing Ge, Jinzhong Li, JunRuiLee, Laffery, Leonard Xu, Lijie Wang, Martijn Visser, Marton Balassi, Matt Wang, Matthias Pohl, Matthias Schwalbe, Matyas Orhidi, Maximilian Michels, Mingliang Liu, Máté Czagány, Panagiotis Garefalakis, ParyshevSergey, Patrick Lucas, Peter Huang, Peter Vary, Piotr Nowojski, Prabhu Joseph, Pranav Sharma, Qingsheng Ren, Robin Moffatt, Roc Marshal, Rodrigo Meneses, Roman, Roman Khachatryan, Ron, Rui Fan, Ruibin Xing, Ryan Skraba, Samrat002, Sergey Nuyanzin, Shammon FY, Shengkai, Stefan Richter, SuDewei, TBCCC, Tartarus0zm, Thomas Weise, Timo Walther, Varun, Venkata krishnan Sowrirajan, Vladimir Matveev, Wang FeiFan, Weihua Hu, Weijie Guo, Wencong Liu, Xiangyu Feng, Xianxun Ye, Xiaogang Zhou, Xintong Song, XuShuai, Xuyang, Yanfei Lei, Yangze Guo, Yi Zhang, Yu Chen, Yuan Mei, Yubin Li, Yuepeng Pan, Yun Gao, Yun Tang, Yuxin Tan, Zakelly, Zhanghao Chen, Zhu Zhu, archzi, bvarghese1, caicancai, caodizhou, dongwoo6kim, duanyc, eason.qin, fengjiajie, fengli, gongzhongqiang, gyang94, hejufang, jiangxin, jiaoqingbo, jingge, lijingwei.5018, lincoln lee, liuyongvs, luoyuxia, mimaomao, murong00, polaris6, pvary, sharath1709, simplejason, sunxia, sxnan, tzy123-123, wangfeifan, wangzzu, xiangyu0xf, xiarui, xingbo, xuyang, yeming, yhx, yinhan.yh, yunfan123, yunfengzhou-hub, yunhong, yuxia Luo, yuxiang, zoudan, 周仁祥, 曹帝胄, 朱通通, 马越
In this blog post, we will provide an overview of Confluent's new SQL Workspaces for Flink, which offer the same functionality for streaming data that SQL users have come to expect for batch-based systems.
Learn how to build a Java pipeline that consumes clickstream data from Apache Kafka®. Consuming clickstreams is something that many businesses have a use for and it can also be generalized to consuming other types of streaming data.