Rise of the Kafka Heroes! Join the Data Streaming Revolution | Read the Comic
ETL stands for Extract, Transform and Load, a set of common processes for collecting, integrating, and distributing data to make it available for additional purposes, such as analytics, machine learning, reporting, or other business purposes. ETL enables an organization to carry out data-driven analysis and decision making using operational data.
Built by the original creators of Apache Kafka, Confluent powers scalable, reliable, real-time streaming ETL across 120+ data sources.
At its core, ETL is a three step process: (1) data is collected from one or more sources, (2) transformed to apply the required business logic to the data, including cleaning it, re-modelling it, and joining it to other data, then (3) stored into its new destination (such as a data warehouse or non-relational database).
(Note: A common subvariant of ETL is ELT (extract load transform). ETL performs the transformation step before loading the data, while ELT performs the transformation step after loading the data.)
ETL is not new. In fact, it’s evolved quite a bit from the 1970s and 1980s, where the process was sequential, data was more static, systems were monolithic, and reporting was only needed on a weekly or monthly basis.
In this step, the focus is first to understand where useful original data is stored and what form or formats it’s in. Then, processes are implemented to access it, perhaps through recurring nightly batch processes, or in real time or triggered on the occurrence of specific events or actions.
In this step, original data is cleaned, formats are changed, and data is aggregated so it's in the proper form to be stored into a data warehouse or other sources, so it can be used by reporting tools or other parts of the business. Example transformations include:
In this step, the transformed data is stored in one or more places where applications, reporting tools, and other business processes can access it, such as in unstructured object stores, text files, or complex data warehouses. The process varies widely depending on the nature of the business requirements and the applications and users the data serves.
ETL enables data integration: an organization can gather data from multiple data sources, combine it, and then distribute it to where it’s needed. This allows different types of data to work together, and supports data-driven analysis and decision making using operational data
ETL often plays a critical role in getting the right data to the right place for business intelligence and executing broader data management strategies.
Common use cases for ETL include moving data from transactional or operational databases into a data warehouse or data lake for analysis and reporting, where it’s easier or more efficient to analyzing large amounts of data than would otherwise be possible in an operational database.
Other use cases for ETL include data cleansing and transformation to make it easier to work with, and data integration to combine data from multiple sources for a more complete view of the data.
Reverse ETL (rETL, also known as Operational Analytics) is a process in which the traditional ETL flow is reversed: cleaned and transformed data is sent from, say, a data warehouse into a business application. This is typically done in order to access analytical data in an operational context, to make better data-driven decisions.
ELT (Extract, Load, and Transform) is a variation of ETL where data is extracted from one or more sources, loaded into a destination database, and then transformed in place into the desired format using the destination’s processing capabilities. This process is used to quickly load and integrate data from multiple sources without the need for extensive pre-processing.
The main difference between ETL and ELT is the order in which the data transformation and loading steps are performed. ETL performs the transformation step before loading the data, while ELT performs the transformation step inside the destination after loading the data.
EtLT is a sub-variation of, or elaboration on, ELT where data undergoes some basic transformations between Extract and Load. However, major transformations, to support business logic or data modeling, is still reserved for the final transform stage.
The successful implementation of a reliable, robust, and secure multi-step process that accesses, transforms, and stores data across multiple data sources and sinks involves successfully navigating a number of challenges.
For instance:
Permissions: Do your networks and systems have access and rights to the data?
Data freshness: Are you capturing real-time data, or stale data that's no longer of value? What is the ephemeral nature of the data? Are you able to capture it before the data passes its lifetime?
Data quality and integrity: Do you have validation in place to notice if the data that is extracted is in an expected form? Combining data from multiple sources can be challenging due to differences in data formats, structures, and definitions. It can be difficult to identify and correct errors or inconsistencies in the data. Reasoning about the lineage of data (i.e. what the original source was) can be difficult once data sources are integrated.
Availability and scale: Is there enough storage and compute in your staging area to keep up with the data? (The more data that needs to be transformed, the more computationally and storage intensive it can become.)
Filtering: Which data is important data and which can be ignored or discarded?
Security and privacy: ETL processes often involve sensitive or confidential data. Organizations want to ensure that data is protected and accessed only by authorized users, with increased complexity when dealing with large volumes of data or data from multiple sources.
Data governance and compliance: data governance policies and regulations stipulate how data provenance, privacy, and security is to be maintained, with additional complexity arising from integration of complex data sets or data subject to multiple disparate regulations.
Data transformation and cleansing: ETL processes often require significant data transformation and cleansing in order to prepare data for analysis or integration with other systems.
Integration with other systems: ETL processes often involve integrating data with other systems, such as data warehouses, analytics platforms, or business intelligence tools. This can be challenging, particularly when dealing with complex data structures or integration requirements.
Apache Kafka® is a distributed streaming platform that is the core of modern enterprise architectures. Kafka is a flexible and scalable foundation for building ETL pipelines that connect distributed systems and event-driven applications, allowing businesses to process data in real time and at scale:
Kafka Connectors run within the Kafka Connect framework enable developers to extract data from one system or load it into another
Kafka Streams API gives applications the stream processing capabilities to transform data, one message or event at a time. These transformations can include joining multiple data sources, filtering data, and aggregating data over a period of time
Confluent Schema Registry provides data governance with Avro schemas
Confluent Control Center provides central data management
ksqlDB is a database purpose-built for stream processing applications, providing a familiar, lightweight syntax to capture, process, and serve queries using only SQL.
Stream Designer provides a graphical canvas to build pipelines powered by Apache Kafka®, with SQL extensibility.
In contrast to Batch ETL, Kafka enables ETL pipelines that are streaming, allowing real-time transformation and distribution of data at large scale.