[Webinar] How to Implement Data Contracts: A Shift Left to First-Class Data Products | Register Now
In today's competitive landscape, data warehouses and data lakes are the essential platforms for business intelligence, analytics, and AI. While immensely powerful, these systems were traditionally designed for batch data processing, often leading to insights based on data that is hours or even days old. The primary challenge has always been the complexity of bridging the gap between real-time data streams, typically flowing through Kafka, and these analytical systems. This connection has historically required building and maintaining complex, brittle ETL pipelines, a significant drain on engineering resources that inherently introduces delays. In response to this ubiquitous challenge, Confluent developed Tableflow. The goal was simple: Make it push-button easy to represent Apache Kafka® topics as open table formats such as Apache Iceberg™️ (General Availability) , ready for consumption by data lakes and warehouses but without complex ETL pipelines.
Last year, we shared our vision for Tableflow to bridge the gap between real-time operational data flowing through Kafka and the analytical AI and machine learning (ML) ecosystems. Throughout our Early Access (EA) program, we received invaluable feedback from customers who experienced firsthand how Tableflow could simplify their data architectures and accelerate insights. Building on that momentum and success, we announced the general availability (GA) of Tableflow during Current Bengaluru ’25, featuring robust support for Iceberg and seamless integration with the AWS Glue Data Catalog!
Amazon Web Services (AWS) is equally excited about this expanded integration, which makes it even easier for our joint customers to unlock the value of their real-time data streams using the breadth of AWS analytics and ML services. This collaboration underscores our shared commitment to helping organizations innovate faster by simplifying data access and management in the cloud. This blog post examines the historical separation of analytical and operational systems, detailing how Tableflow's innovative features facilitate their convergence. Furthermore, it elaborates on Tableflow’s seamless integration with AWS analytical services such as AWS Athena, Amazon Redshift, AWS EMR, and Amazon SageMaker Lakehouse, leveraging the AWS Glue Data Catalog to enable real-time analytics.
Operational and analytical systems have historically been distinct due to their differing design principles and objectives. Operational systems, encompassing microservices, software-as-a-service (SaaS) applications, and transactional databases, prioritize rapid, high-volume transaction processing to support real-time application responsiveness. Conversely, analytical systems are engineered for intricate queries, historical data exploration, and AI development, necessitating batch processing and specialized storage architectures.
It’s essential that the data collected in the operational estate must be shared with the analytical estate in real time to perform analytics and generate business insights. Essentially, you want to streamline the process of feeding your operational data in Kafka topics into Iceberg tables so that it’s ready to power analytics in data lakes or warehouses such as AWS Athena or Amazon RedShift.
Feeding raw operational data from Kafka into Amazon S3 and other data lakes and warehouses in Iceberg format is a complex, expensive, and error-prone process that requires building custom data pipelines. In these pipelines, you need to transfer data (using sink connectors), clean data, manage schema evolutions, materialize change data capture streams, transform and compact data, and store it in Apache Parquet™️ files. Further, ongoing maintenance is required to keep these Iceberg tables clean and performant.
This intricate workflow demands significant effort and expertise to ensure data consistency and usability. What if you could eliminate all the hassle and have your Kafka topics automatically materialized into analytics-ready Iceberg tables in your data lake or warehouse? That’s precisely what Tableflow allows you to do.
Tableflow revolutionizes the way Kafka data is materialized into data lakes and warehouses by seamlessly representing Kafka topics as Iceberg tables. Tableflow uses innovations in Kora’s storage layer that allow the flexibility to take Kafka segments and write them out to other storage formats—in this case, Parquet files. Tableflow also uses a new metadata publishing service behind the scenes that taps into Confluent’s Schema Registry to generate Iceberg metadata transaction logs while handling schema mapping, schema evolution, and type conversions.
Here are the key capabilities of Tableflow:
Data Conversion: It converts Kafka segments and schemas in Avro, JSON, or Protobuf into Iceberg-compatible schemas and Parquet files, using Schema Registry in Confluent Cloud as the source of truth.
Schema Evolution: It automatically detects schema changes, such as adding fields or widening types, and applies them to the respective table.
Catalog Syncing: You can sync Tableflow-created tables as external tables in the AWS Glue Data Catalog, Snowflake Open Catalog, and Databricks Unity Catalog (coming soon).
Table Maintenance and Metadata Management: It automatically compacts small files and also handles snapshot expiration and file cleanup.
Choose Your Storage: You can store the data in your own Amazon S3 bucket or let Confluent host and manage the storage for you.
With just the push of a button, you can now represent your Kafka data in Confluent Cloud as Iceberg tables to feed your data lake, warehouse, or analytical engine.
The AWS Glue Data Catalog is a centralized metadata repository for all data assets across various data sources. It acts as a comprehensive catalog for your data lake, providing a unified view and making your data discoverable, searchable, and queryable. Key features include:
Automated Schema Discovery: The catalog uses "crawlers" to automatically scan your data sources. These crawlers intelligently infer the schema, data types, and partitions of your data and then populate the catalog with this information, saving you from manual and error-prone data definition.
Seamless Integration With AWS Services: It’s natively integrated with a wide array of AWS analytics services. This allows services such as Amazon Athena (for queries), Amazon Redshift Spectrum (for data warehousing), and Amazon EMR (for big data processing) to directly use the metadata from the AWS Glue Data Catalog, streamlining your data pipelines.
Schema Versioning and Evolution: The catalog tracks changes to your data's structure over time. This is a crucial feature for data governance, as it helps prevent downstream applications and queries from failing when your data format evolves.
Improved Data Governance and Discoverability: By providing a central, searchable catalog of your data, it makes it much easier for users to find and understand the data they need. This enhances data governance and empowers users to leverage data assets more effectively.
Tableflow allows you to seamlessly discover managed Iceberg tables by integrating with the AWS Glue Data Catalog or by using its own built-in Iceberg REST catalog. As stated earlier, Tableflow's primary purpose is to materialize streaming Kafka data as Iceberg tables. These tables are written directly to cloud object storage (Amazon S3), enabling a durable and query-optimized representation of the original event stream without requiring custom pipelines or ETL processes.
Now let’s look at a walk-through demo of Tableflow integration with the AWS Glue Data Catalog, which is as easy as a few clicks. To begin, start within your Confluent Cloud environment and enable Tableflow on the Kafka topics you want to see in AWS Glue. As part of this setup, specify your Amazon S3 bucket as the destination where Tableflow will write the data in Iceberg format. While setting up Amazon S3, use AWS AssumeRole and establish the necessary provider integration within Confluent Cloud. Tableflow handles the ongoing conversion of Kafka messages into Parquet files, organizes them into Iceberg tables, and manages the associated metadata and maintenance tasks like compaction. It essentially prepares your streaming data for easy consumption by analytical engines.
Once Tableflow is materializing your Kafka data as Iceberg tables in an Amazon S3 bucket, the next step is to make these tables discoverable and ready to query from a query engine such as Amazon Athena. This is where the AWS Glue Data Catalog capabilities come into play. In Confluent Cloud, at the cluster level, select the Tableflow tab, add an External Catalog Integration, and select AWS Glue as the external catalog type.
Then select a provider integration for this catalog connection. The provider integration needs permissions to write to your AWS Glue Data Catalog and can be the same or different from the one you used for Tableflow to write to your Amazon S3 bucket in an earlier step.
All that’s left is to review and launch the integration. With this setup, Tableflow automatically publishes Iceberg table metadata pointers to the AWS Glue Data Catalog, enabling AWS Analytics services or third-party compute engines compatible with Amazon SageMaker Lakehouse to access these tables.
Once the integration is complete, you can discover Iceberg tables materialized by Tableflow in the AWS Glue Data Catalog.
From there, you can use your preferred AWS Analytics service to consume the Iceberg tables.
You can use Amazon Athena to run SQL queries on these Iceberg tables. By configuring Athena to use the AWS Glue Data Catalog as the Iceberg catalog, you can efficiently explore and analyze the data directly from your data lake.
If you're working with Amazon Redshift, you can configure it to integrate with the AWS Glue Data Catalog as well. This setup allows Redshift to query the same Iceberg tables, enabling fast, scalable data processing and analysis across your datasets.
Alternatively, if you prefer Apache Spark™️ for data processing, you can use Amazon EMR to query the Iceberg tables. By configuring your EMR cluster to access the AWS Glue Data Catalog, Spark jobs can directly interact with the Iceberg tables, allowing you to perform advanced transformations and analytics on the data.
This powerful combination of Tableflow on Confluent Cloud and AWS analytics services delivers tangible benefits for our mutual customers:
Faster Time to Insight: Reduce the delay in making streaming data available for analytics and AI from days or weeks down to minutes or hours. React faster to changing business conditions and customer behavior.
Reduced Complexity and Cost: Eliminate the need to build, manage, and scale custom ETL/ELT pipelines. Free up valuable data engineering resources and reduce associated infrastructure compute costs, as Tableflow's managed maintenance optimizes storage and performance.
Improved Data Governance and Consistency: Maintain a single source of truth originating from Kafka, ensure consistent schema application via Schema Registry, and provide unified discoverability through the AWS Glue Data Catalog.
Enhanced AI/ML Initiatives: Fuel your ML models on Amazon SageMaker and other platforms with fresher, more accurate, and readily accessible real-time data, leading to more relevant predictions and insights.
Better return on investment: Maximize the value extracted from your investments in both the Confluent data streaming platform and the comprehensive suite of AWS analytics services.
The general availability of Tableflow on AWS with Iceberg support and AWS Glue integration marks a significant milestone, but our journey doesn't stop here. Confluent and AWS are deeply committed to simplifying data architectures and accelerating innovation for our customers. Tableflow is core to our strategy of helping organizations effectively leverage their data in motion across the entire AWS cloud ecosystem.
Throughout the remainder of FY '25 and beyond, Confluent will continue to invest heavily in Tableflow, enhancing its capabilities, performance, and integration points based on customer feedback and evolving market needs. Our product roadmap entails continued enhancements such as the general availability of Delta Lake support, Unity Catalog integration, Upsert capabilities, Dead Letter Queue (DLQ) functionality, Apache Flink® integration, bidirectional data flow, and support for Microsoft Azure and Google Cloud Platform to further optimize the data transfer from operational systems to analytical systems, facilitating insightful analytics and AI-driven initiatives. Together, Confluent and AWS will continue to empower organizations to build next-generation, real-time applications and analytics on the cloud.
Ready to eliminate ETL complexity and unlock real-time analytics on AWS? Explore Tableflow today!
Learn More: Dive into the Tableflow product documentation.
See it in Action: Watch our short introduction video or Tim Berglund's lightboard explanation.
Get Started: If you're already using Confluent Cloud, navigate to the Tableflow section for your cluster. New users can sign up for Confluent Cloud on AWS marketplace and explore Tableflow's capabilities.
Additionally, contact us today for a personalized demo and start unlocking the full potential of your data on AWS. We’re incredibly excited to see how you leverage Tableflow and AWS to turn your real-time data streams into tangible business value!
The preceding outlines our general product direction and is not a commitment to deliver any material, code, or functionality. The development, release, timing, and pricing of any features or functionality described may change. Customers should make their purchase decisions based on services, features, and functions that are currently available.
Confluent and associated marks are trademarks or registered trademarks of Confluent, Inc.
Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, Apache Spark™️, Spark™️, Spark™️, Apache Iceberg™️, Iceberg™️, and the associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by using these marks. All other trademarks are the property of their respective owners.
Amazon and all related marks are trademarks of Amazon.com, Inc. or its affiliates.
Tableflow represents Kafka topics as Apache Iceberg® (GA) and Delta Lake (EA) tables in a few clicks to feed any data warehouse, data lake, or analytics engine of your choice
Tableflow can seamlessly make your Kafka operational data available to your AWS analytics ecosystem with minimal effort, leveraging the capabilities of Confluent Tableflow and Amazon SageMaker Lakehouse.