Ahorra un 25 % (o incluso más) en tus costes de Kafka | Acepta el reto del ahorro con Kafka de Confluent
Data streaming with events supports many different applications and use cases. Event-driven microservices use data streaming, allowing companies to build applications based on domain-driven designs. This approach allows teams to break applications into composable microservices that can be worked on independently, speeding development. These designs scale well and can process huge amounts of data efficiently. As a result, data streaming has become popular for building large-scale cloud applications using microservices. It’s also widely used in enterprises to share data and drive decision-making. Apache Kafka® data products can be shared easily in real time through streaming.
Kafka was the key enabling technology for data streaming. Its popularity grew along with microservices-based applications development. Its asynchronous publish/subscribe architecture, scalability, and open source character made it well suited for data sharing across enterprises via well-defined data products.
Today, Kafka is widely used for data streaming between applications, but data integration between specific Kafka producers and consumers remains a common use case. In this article, we’ll explore two expressions of data integration: Kafka’s and ETL’s.
In the original blog post that introduced Kafka back in 2013, data integration was presented as Kafka’s original use case. Data integration was defined as the ability to broadcast and share data using a centralized log. With the log, data could be shared between any source(s) and destination(s) in the enterprise estate—efficiently, at scale, and in real time. This interpretation of data integration, which we’ll refer to as streaming data integration, is different from ETL-oriented “data integration,” the most widely used and understood form of data integration today.
First, let’s consider data integration built with ETL.
Today, data integration commonly refers to a point-to-point, isolated transfer between a source and destination, with data periodically extracted from the source and loaded into the destination.
Transformations to shape the data (specifically for the destination) can happen after extract (ETL) or after the load (ELT). This traditional approach will be referred to as ETL data integration.
ETL data integration is popular and has several advantages:
Simplicity via no-code tools with easy-to-use interfaces. Lots of source-destination pairs need to be created with ETL, so the process must be as simple as possible.
Wide coverage for different sources and destinations.
Well-known and repeatable use cases (e.g., implementing the Medallion architecture).
Doesn’t require users to set up, configure, and operate a centralized log like Kafka.
ETL is simpler than Kafka streaming because it maps a source directly to a single destination. It doesn’t have to generalize the source data into a reusable stream to support different use cases. It can use its own custom serialization format designed to support ETL data transfers, schema, and type conversions efficiently and in a generalized manner. (Note: Kafka also allows users to develop their own custom serialization formats, but it’s easier to reuse existing formats such as Avro and JSON Schema.)
Most ETL systems use a canonical ledger that helps simplify the definition of the mappings for schemas and types between each potential pair of sources and destinations. This helps handle both the general case and edge cases. The ledger keeps this mapping information in a single location in the code repository so that it’s explicit and available to all code modules. It can be tested, reasoned about, and evolved as new versions of the sources and destinations evolve since it’s concentrated in one place.
That’s the good news about ETL. But ETL also has drawbacks.
Traditional ETL, especially from operational databases, has limitations:
Instead of creating a canonical shared form for the data that encourages reuse, ETL instead transforms the data to fit the specific, narrow use case of the destination, often a data warehouse. This discourages reuse.
Since no shared log exists to store the original extracted data, reuse is effectively impossible outside the destination. The vendors building destinations like data warehouses or lakehouses see this as a feature. For data engineers and business owners who want to get the most out of their data across the enterprise data estate, it’s a bug. All that effort and expense to liberate data from the source, and it ends up locked up again.
Each ETL process has to be set up and managed independently. The modern data stack improved the implementation and ease of use of each ETL connection, but the proliferation of connections remained. (Note: dbt simplified SQL transform workflows. Fivetran and other ETL vendors simplified the configuration and setup. Snowflake and Databricks reinvented the data warehouse in the cloud. Nonetheless, the ETL workflow pattern used since the data warehouse was first introduced remains the same.) Each ETL connection has to be set up and managed independently. Also, Kafka’s design allows multiple sources to connect to a single destination, whereas ETL data integration requires separate connectors on the destination for each source. This reduces both complexity and cost.
ETL data integration emphasizes batch processing, with the destination periodically polling the source to obtain a portion of the data. This reduces the overhead on the source (important, since multiple ETL connections may be operating at the same time) at the expense of higher latency and potentially incomplete data. (Note: These batch processes are often applied serially as data is aggregated and enriched for end consumers. So data is often hours old before reaching its final destination.) Since lakehouses and data warehouses are common destinations, extracting data again from these platforms requires a slow, reverse ETL process. Real time is not possible, in general, with ETL data integration. (Kai Waehner has blogged extensively about reverse ETL from data warehouses and lakehouses, pointing out in most use cases that it’s an anti-pattern.)
ETL leads to data that’s isolated and siloed, transformed for a specific use case in a particular vendor’s data warehouse, making it difficult to share data across an enterprise in a canonical form. Jack Vanlightly called out why that approach no longer works:
“But even that aside, if we look at the world today, it is much more complex than what ETL was originally designed for. It’s not just moving data from many relational databases to one data warehouse. Data doesn’t just live in the operational and the analytics estates; we now have SaaS representing a third data estate. Data flows across regions and clouds, from backend systems to SaaS and vice versa. There are probably 100x more applications now than there used to be. Organizations are becoming software, with ever more complex webs of relationships between software systems. ETL, ELT, and reverse ETL are looking at this problem from a silo mindset, but modern data architectures need to think in graphs.”
Vanlightly’s vision of derived datasets being expressed as graphs is precisely what streaming data integration, covered in the next section, enables. Data sets that are richly interconnected in flexible ways is what the Kafka mindset for data integration is all about.
Next, let’s now look at how streaming data integration with Kafka can address the shortcomings of ETL data integration.
Streaming data integration with Kafka is significantly better than ETL data integration in important ways:
Unlike ETL data integration, streaming data integration spans seamlessly from real-time data streaming to batch.
One source to many destinations and many sources to one destination are both implemented efficiently in Kafka, but not in ETL.
Streaming data integration supports enriched, reusable, canonical streams that can be transformed, shared, or replicated to different destinations, not just one.
For example, database change data capture (CDC) streams allow materialization of views to generalize their use. (Note: This still follows the original central log design principle that the source stream not be specialized to the destination, as the CDC stream can be repurposed and materialized many different ways.) Different views can be materialized for different destinations, whether it’s a lakehouse for analytics, real-time updates for a Redis cache, or maintaining a logical replica in a different database. The stream can be transformed within Kafka or at the destination, based on its intended use. CDC event streams can be reused for event-driven applications while simultaneously driving these streaming integration workloads.
Streaming data integration promotes and leverages reusable data products to meet different requirements.
Streaming data integration unifies real-time and historical data analysis. Data streaming with real-time data is complemented by open table storage for historical data analysis.
As much as possible, streaming data integration should implement the best practices from modern ETL data integration, including the streamlined workflows used to set up each ETL process. These complement streaming data integration’s advantages. These best practices include:
Easy-to-use, intuitive, point-and-click workflows
Reduced toil on repetitive tasks (config management, passwords, debugging)
A single pane of glass supported by a metadata catalog of sources, removing the need to log into a database or software-as-a-service (SaaS) app to see the available data
Making common tasks, such as remapping one table to another, easy to do
Streaming data integration is about building shareable data products that are richly interconnected in flexible ways. It’s a common pattern in data streaming, and it’s more flexible, efficient, and powerful than traditional ETL data integration. Keeping your data in motion via streaming data integration lets you move your data where you need it, when you need it.
The streaming data integration vision described in the original Kafka blog post lets you share data, for example, to your Postgres database if you need it to do combined analytics between your Postgres order and payment system and an external clickstream that's in Kafka.
Or if your developers prefer Databricks for that analysis (and if they’re Databricks developers, they will), you can stream the clickstream data to Databricks and create data products from the Postgres orders and payments systems and stream them to Databricks as well.
And yes, that's work, but now you have those shared data products in Kafka, shareable for other use cases. So you could, for example, stream the clickstream and Postgres order and payments stream data to Snowflake if your developers or analysts prefer that for a different analytics use case.
You can adjust the latency knob with tools like Tableflow that integrate streaming with Apache Iceberg™️. If latency isn’t critical, pull your data from Iceberg. If latency is critical, pull it from Kafka. It's your choice.
But you don't have to switch to or support different infrastructure to do this. You don’t have to create a bunch of new point-to-point ETL connections. This is why point-to-point data integration is so frustrating and limiting to everyone—from users to enterprise and data architects.
It's this use-case-by-use-case approach that drove ETL streaming integration companies to focus so hard on ease-of-use per point-to-point connection. That's how you win in that paradigm.
At Confluent, we know that approach is broken and ultimately limiting, and that’s why we’ve been reluctant to engage with what the industry defines as "data integration" today. As currently defined, it's a product category that’s at odds with our vision for keeping data shared and in motion. Streaming data integration is a better way forward.
Streaming data integration supports enriched, reusable, canonical streams that can be transformed, shared, or replicated to different destinations, not just one.
Kafka’s broadcast-oriented protocol allows additional consumers to be added seamlessly, with no inherent limits to bandwidth or number of consumers.
Unlike ETL data integration, streaming data integration spans seamlessly from real-time data streaming to batch and unifies real-time and historical data analysis. In fact, Kafka has emphasized real-time streaming since its inception, so it’s built into the foundation of the protocol.
Data streaming with real-time data is complemented by open table storage for historical data analysis.
Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, Apache Iceberg™️, and Iceberg™️ are either registered trademarks or trademarks of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of these marks.
Discover how a data streaming platform helps you unlock the full potential of your AI—and translates it into measurable business value.
AI is bringing changes in developer experience… we shared what we learned in this article about creating our new GitHub Copilot chat extension for data streaming engineers.