Simplify & secure Kafka at scale—see what’s new in Confluent Platform 8.0 | Register Now

Solving ETL Challenges by Integrating Apache Kafka® and Zero ETL via Confluent Tableflow

作成者 :

Operational and analytical estates have been separated since data warehouses were first introduced in the 1990s. The operational estate includes microservices, software-as-a-service (SaaS) apps, and enterprise resource planning systems (ERPs) that have become the beating heart of an organization. The analytical estate consists of the data warehouses, lakehouses, artificial intelligence (AI)/machine learning (ML) platforms, and other custom batch workloads that support business analysis and reporting.

Organizations have traditionally managed these estates separately, connecting them with fragile ETL/ELT pipelines. History has shown the ETL pipeline approach to be both fragile and a drain on data teams that would rather spend time extracting insights from the data instead of stewarding its movement from one plane, operational or analytic, to the other.

How Does Kafka Compare to Traditional Point-to-Point ETL?

Apache Kafka® has come to play an important role as the bridge between the two sides of an organization. Many believe that Kafka’s core abstraction, the log, is perhaps the most fundamental building block in data systems.

The bridge between analytical and operational estates can be built on the log as a foundational building block. We can accomplish this by capturing changes made to operational databases and applications and publishing them as a sequence of changes to a log, often referred to as the change data capture (CDC) or redo log.

Microservices communicate ordered events over logs, and CDC data in logs can be rolled up into tables. This is why Kafka has become so important as a building block in connecting systems of record. It has become the bridge between the operational and analytical estates.

The figure below shows that Kafka provides a shared, centralized, and scalable gateway for data. It provides networking, storage, and compute to scale up the log abstraction, supporting thousands of logs in-flight simultaneously.

ETL, in contrast, is point-to-point and operates at the edge, executing directly on data producers or consumers. ETL requires more connections and direct mappings from each producer to each participating consumer. Many ETL implementations cannot achieve real-time, complete, and continuous capture of CDC or other data and fall back to periodic batch sampling.

Kafka provides a shared, centralized gateway for data. ETL is point-to-point.

Zero ETL is another alternative for unifying the operational and analytical estates. Some might argue that Tableflow—how Confluent materializes Kafka topics in open table formats—is just a variation on the trend toward zero-ETL integrations between databases, data warehouses, and lakehouses and the data consumers they share data with.

Learn more about Tableflow.

What is Zero ETL?

In principle, zero ETL allows a client to read tables in a data source and to query the data in its original form. Amazon Web Services (AWS) introduced the term during its 2022 re:Invent conference, highlighting the integration of Amazon Aurora with Amazon Redshift. Later, Salesforce and Snowflake enabled zero ETL for bidirectional data sharing.

Zero ETL can be implemented via techniques like database replication, query federation, or data streaming, but the implementation is invisible to the user. The zero-ETL user merely configures the source tables they want to access in the source database, warehouse, or lakehouse.

Some implementations copy the data from the source before it’s queried or read at the consumer to avoid interference of query processing and overhead with other source database work. Others send client queries to the source and return the query results to the consumer. This can reduce performance for other source database work, but it avoids the network and storage costs associated with making a complete table copy and keeping it in sync.

As seen in the figure below, Kafka is many-to-many, whereas zero ETL is one-to-many. A single zero-ETL source server provides access for many consumers. Kafka producers create topics that can be reused and accessed by any consumer. Unlike zero ETL, it doesn’t create point-to-point connections between the producer and consumer; instead it uses its shared infrastructure.

Kafka is many-to-many. Zero ETL is one-to-many.

The name “zero ETL” implies that not having any ETL is a good thing. Proponents claim it allows you to achieve the following:

  • Query and analyze data directly in the source’s format.

  • Potentially minimize latency and intermediate steps.

  • Remove the complexity of maintaining traditional ELT and ETL pipelines.

Zero ETL provides a thin operational layer that hides the underlying data replication or query forwarding implementation details from the user. It can work well when the user need to logically connect a single data source to a few consumers.

Zero ETL requires minimal infrastructure compared to Kafka. It’s a framework that includes technical components (e.g., a data source, network, and client) that it connects both logically and physically. Thus, implementation and configuration are simplified. The underlying transfer mechanism is hidden while the user specifies the source, target, and mappings between the source and target tables and business objects.

Delta Sharing – Databricks’ Approach to Zero ETL

Delta Sharing is Databricks’ popular approach to zero ETL. It’s an example of an open protocol that supports both Delta Lake and Apache Parquet™ through a table-format-agnostic data-sharing server. Unity Catalog combined with Delta Sharing enables centralized governance, auditing, and access control for shared Delta Tables, ensuring secure data sharing across organizations. Additional benefits include:

  • Cross-platform data sharing from Delta Lake and Parquet

  • Sharing of live data with no replication

  • A marketplace for data products

What’s the Difference Between Tableflow and Zero ETL?

Like ETL, pure Kafka implementations rely on the consumer to reconstruct and materialize table data from the CDC log for use by SQL-driven analytic engines. What if we could integrate the table materialization process with Kafka directly so that consumers could share materialized topics as tables?

Since Kafka logs were so frequently used by consumers to materialize tables, Confluent developed a direct integration between logs and tables called Tableflow. It aims to take that bridge one step further—by being both simpler and tailor-made for developers and data practitioners in both the analytic and operational planes. 

Since logs can also be represented as traditional databases, warehouses, or lakehouse tables, combining logs and tables in the same system helps reunify the analytical and operational estates, simplifying operations and enriching applications with analytics. It also allows applications and analytic tools to share data via unified schemas and namespaces without forcing the user to manage complicated data pipelines.

Tableflow leverages popular open table formats (i.e., Apache Iceberg™ or Delta Lake) to unlock access to Kafka logs or topics. Sometimes referred to as “headless” data storage, open table formats have become increasingly popular.

By separating the compute engine from its storage, different compute engines can leverage the same tables, simplifying operations and reducing redundancy. Iceberg has a growing ecosystem of tools and compute engines, including Apache Spark®, Apache Flink®, Snowflake, Amazon Athena, Dremio, Trino, Apache Druid®, Amazon S3, and many others.

Zero ETL Limitations Solved When Combined With Tableflow

We’ve described some of the benefits of zero ETL. Let’s now explore how Tableflow is both different and more powerful than zero ETL in certain areas.

Limited Reusability With Zero ETL Alone vs With Tableflow

Tableflow’s data streaming frontend makes topics (and therefore tables) reusable and materializable to other consumers. Using Kafka results in a pool of topics with schemas that can be streamed to data consumers or materialized as tables via Tableflow. Stream processing with Flink or Kafka Streams can enable transformations to existing topic schemas to create new schemas that meet a consumer’s data contract specification.

With Tableflow, you get the best of both worlds: low-latency access to standardized, schematized topics that can be reused plus table-oriented access to the same data when you need it. Since topics are logs, the history of changes is preserved by Kafka and can be reproduced in Tableflow when necessary.

With Tableflow, Kafka topics are automatically materialized into analytics-ready Iceberg or Delta Lake tables.

Tight vs Loose Coupling Between Producer and Consumer

Although zero ETL holds the promise of simplifying or removing complex data pipelines, it tightly couples the producer and consumer schemas. This means any changes to the producer schema will potentially break consumer applications downstream.

There is no clear way to decouple producer schemas and consumer or client views when using zero ETL. Consumers need to tightly synchronize with any changes producers make, or producers need to create new tables or views that match what the consumer requires. Data contracts have become popular as a mechanism to enable this and could be a powerful way to manage this tight coupling.

By requiring users to stream data into topics before materializing these topics as Iceberg tables, Tableflow provides some decoupling of data producers from consumers.

This gives users a choice. They can directly materialize topics into Iceberg or Delta Lake via Tableflow, or they can leverage Flink or Kafka Streams to transform the data and schema as required by the consumer and the associated data contract before materializing them via Tableflow.

Complexity and Costs With Zero ETL

Zero ETL’s biggest drawback is related to one of the original reasons that Jay Kreps and others at LinkedIn developed Kafka in the first place. The following figure from the original Kafka blog post shows what zero ETL still requires under the covers: a quadratically increasing number of connections between data producers and consumers.

How tight coupling between producers (M) and consumers (N) results in MxN connections in enterprise systems.

To achieve full connectivity between M data producers and N consumers, zero ETL requires MxN connections to be created and managed. In contrast, Kafka requires only M+N connections. Kafka connections use a common operating model embodied in a purpose-built distributed system.

Users need to design zero-ETL infrastructure for every source. Delta Sharing is a step in the right direction, but its support for heterogeneous sources is somewhat weak. However, by providing direct, parallel access to the underlying S3 object storage, Delta Sharing clients can gain scalable access to shared data.

Consumer Scalability Challenges With Zero ETL

Unlike Kafka, some zero-ETL implementations can’t scale on the consumer side because they put more and more burden on the sole data source as more consumers or clients are added. In contrast, Kafka clusters can add more brokers and leverage follower fetching to provide additional consumer-side streaming bandwidth to support more consumers, independent of the data producers.

The zero-ETL Delta Sharing protocol provides scalability to consumers by allowing direct access to the underlying object storage, avoiding a bottleneck at the Delta Sharing server.

Ironically, in some cases, like Salesforce’s zero-ETL integrations with Snowflake and other targets, Kafka is used to scale the infrastructure. But topics are hidden, data reuse is not possible, and history is not maintained, as mentioned earlier.

Combining Tableflow and Delta Sharing to Get the Benefits of Zero ETL Plus the Power of Apache Kafka®

So far, we’ve discussed zero ETL and Kafka with Tableflow as competing alternatives. But if we compare their advantages, we discover that they can be quite complementary.

Zero ETL works for small numbers of consumers that want to directly consume table data in its existing format from another table-oriented platform. Tableflow can materialize large numbers of topics into tables that can then be easily shared by Delta Sharing or other zero-ETL protocols such as Snowflake’s Secure Data Sharing.

Integration of Tableflow and zero ETL via Delta Sharing.

Tableflow retains Kafka’s scalability, topic reusability, and schema transformation capabilities, plus it transfers data in real time and avoids the server overutilization that can plague the zero-ETL servers used by many consumers.

Since zero ETL and Tableflow are offered as SaaS services, users can avoid the complexity of deploying these technologies themselves and focus on the more important problem of creating business value for their internal users and customers.

Ready to solve ETL complexity and unlock real-time analytics?

‎ 

Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, Apache Iceberg™, Apache Parquet, Apache Spark, Apache Druid, Iceberg™, and the Iceberg logo are either trademarks or registered trademarks of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of these marks.

  • Matthew O’Keefe is a Principal Technologist at Confluent focusing on data modeling and shifting left. His background includes stints as a college teacher, an open source tech company founder, and an Oracle Database VP.

このブログ記事は気に入りましたか?今すぐ共有