Kafka in the Cloud: Why it’s 10x better with Confluent | Find out more

What is ETL?

ETL stands for Extract, Transform and Load, a set of common processes for collecting, integrating, and distributing data to make it available for additional purposes, such as analytics, machine learning, reporting, or other business purposes. ETL enables an organization to carry out data-driven analysis and decision making using operational data.

Built by the original creators of Apache Kafka, Confluent powers scalable, reliable, real-time streaming ETL across 120+ data sources.

extract transform load

The ETL Process Explained

At its core, ETL is a three step process: (1) data is collected from one or more sources, (2) transformed to apply the required business logic to the data, including cleaning it, re-modelling it, and joining it to other data, then (3) stored into its new destination (such as a data warehouse or non-relational database).

(Note: A common subvariant of ETL is ELT (extract load transform). ETL performs the transformation step before loading the data, while ELT performs the transformation step after loading the data.)

ETL is not new. In fact, it’s evolved quite a bit from the 1970s and 1980s, where the process was sequential, data was more static, systems were monolithic, and reporting was only needed on a weekly or monthly basis.

Step 1: Extract Data

In this step, the focus is first to understand where useful original data is stored and what form or formats it’s in. Then, processes are implemented to access it, perhaps through recurring nightly batch processes, or in real time or triggered on the occurrence of specific events or actions.

Step 2: Transform Data

In this step, original data is cleaned, formats are changed, and data is aggregated so it's in the proper form to be stored into a data warehouse or other sources, so it can be used by reporting tools or other parts of the business. Example transformations include:

  • Deriving calculated values based on the raw data
  • Re-ordering or transposing the data
  • Adding meta data or associating key value pairs to the data
  • Removing repetitive data or adding counts of occurrences of data
  • Encoding or decoding the data to match destination requirements
  • Validating the data
  • Performing search and replace functions on the data
  • Changing the schema of the data, for example from text to values or IDs

Step 3: Load Data into Storage

In this step, the transformed data is stored in one or more places where applications, reporting tools, and other business processes can access it, such as in unstructured object stores, text files, or complex data warehouses. The process varies widely depending on the nature of the business requirements and the applications and users the data serves.

What are the benefits of ETL?

ETL enables data integration: an organization can gather data from multiple data sources, combine it, and then distribute it to where it’s needed. This allows different types of data to work together, and supports data-driven analysis and decision making using operational data

ETL often plays a critical role in getting the right data to the right place for business intelligence and executing broader data management strategies.

Common use cases for ETL include moving data from transactional or operational databases into a data warehouse or data lake for analysis and reporting, where it’s easier or more efficient to analyzing large amounts of data than would otherwise be possible in an operational database.

Other use cases for ETL include data cleansing and transformation to make it easier to work with, and data integration to combine data from multiple sources for a more complete view of the data.

Types of ETL

Reverse ETL

Reverse ETL (rETL, also known as Operational Analytics) is a process in which the traditional ETL flow is reversed: cleaned and transformed data is sent from, say, a data warehouse into a business application. This is typically done in order to access analytical data in an operational context, to make better data-driven decisions.

ELT vs ETL

ELT (Extract, Load, and Transform) is a variation of ETL where data is extracted from one or more sources, loaded into a destination database, and then transformed in place into the desired format using the destination’s processing capabilities. This process is used to quickly load and integrate data from multiple sources without the need for extensive pre-processing.

The main difference between ETL and ELT is the order in which the data transformation and loading steps are performed. ETL performs the transformation step before loading the data, while ELT performs the transformation step inside the destination after loading the data.

EtLT

EtLT is a sub-variation of, or elaboration on, ELT where data undergoes some basic transformations between Extract and Load. However, major transformations, to support business logic or data modeling, is still reserved for the final transform stage.

Common ETL tools

A lot of ETL tools have been developed over the years and each one has their own unique strengths developed with their own perspective. ETL tool developers have included database vendors, cloud service providers, business intelligence / reporting vendors, and independent providers focusing exclusively on data integration.

Database vendors often provide ETL tools to enable better data integration within their product suite.

Database ETL Tools:

  • Oracle Data Integrator
  • Microsoft SQLServer Integration Services
  • DB2 Web Query DataMigrator ETL Extension
  • IBM Data Integration

Cloud Service Providers provide ETL capabilities within their offering to facilitate loading data into their services and integration of data amongst their services:

Cloud ETL Tools:

  • AWS Glue: Serverless data integration in AWS
  • AWS Data Pipeline: Managed ETL service for AWS
  • Azure Data Factory: Fully managed serverless data integration for Azure
  • Google Cloud Dataflow: Unified serverless batch and stream processing engine for GCP

Business Intelligence and Enterprise Reporting vendors usually provide ETL tools designed to provide data for business intelligence and reporting operations.

BI ETL Tools:

  • BIRT Analytics Loader
  • SAP Business Objects Data Integrator
  • IBM Cognos Data Manager

Data integration is an industry segment on its own as well. A large subset of that is ETL. Tools originally from vendors specializing in ETL include

ETL Data Integration Tools:

  • Airbyte
  • Pentaho Data Integration
  • Talend Open Studio
  • Hevo Dat
  • Informatica
  • Alteryx
  • Fivetran
  • Matillion

But in our experience, a large amount of ETL is performed without a tool, instead hand-written in custom code using some mix of SQL queries, database triggers, stored procedures, and scripting in languages such as Python, Perl, Java, server-side javascript, or Bash. As data volumes have expanded and data variety has increased, we’ve also seen custom ETL implemented with combinations of Apache Hadoop MapReduce or Apache Spark.

Challenges

The successful implementation of a reliable, robust, and secure multi-step process that accesses, transforms, and stores data across multiple data sources and sinks involves successfully navigating a number of challenges.

For instance:

  • Permissions: Do your networks and systems have access and rights to the data?

  • Data freshness: Are you capturing real-time data, or stale data that's no longer of value? What is the ephemeral nature of the data? Are you able to capture it before the data passes its lifetime?

  • Data quality and integrity: Do you have validation in place to notice if the data that is extracted is in an expected form? Combining data from multiple sources can be challenging due to differences in data formats, structures, and definitions. It can be difficult to identify and correct errors or inconsistencies in the data. Reasoning about the lineage of data (i.e. what the original source was) can be difficult once data sources are integrated.

  • Availability and scale: Is there enough storage and compute in your staging area to keep up with the data? (The more data that needs to be transformed, the more computationally and storage intensive it can become.)

  • Filtering: Which data is important data and which can be ignored or discarded?

  • Security and privacy: ETL processes often involve sensitive or confidential data. Organizations want to ensure that data is protected and accessed only by authorized users, with increased complexity when dealing with large volumes of data or data from multiple sources.

  • Data governance and compliance: data governance policies and regulations stipulate how data provenance, privacy, and security is to be maintained, with additional complexity arising from integration of complex data sets or data subject to multiple disparate regulations.

  • Data transformation and cleansing: ETL processes often require significant data transformation and cleansing in order to prepare data for analysis or integration with other systems.

  • Integration with other systems: ETL processes often involve integrating data with other systems, such as data warehouses, analytics platforms, or business intelligence tools. This can be challenging, particularly when dealing with complex data structures or integration requirements.

Apache Kafka: A Complete Streaming ETL Platform

Apache Kafka® is a distributed streaming platform that is the core of modern enterprise architectures. Kafka is a flexible and scalable foundation for building ETL pipelines that connect distributed systems and event-driven applications, allowing businesses to process data in real time and at scale:

  • Kafka Connectors run within the Kafka Connect framework enable developers to extract data from one system or load it into another

  • Kafka Streams API gives applications the stream processing capabilities to transform data, one message or event at a time. These transformations can include joining multiple data sources, filtering data, and aggregating data over a period of time

  • Confluent Schema Registry provides data governance with Avro schemas

  • Confluent Control Center provides central data management

  • ksqlDB is a database purpose-built for stream processing applications, providing a familiar, lightweight syntax to capture, process, and serve queries using only SQL.

  • Stream Designer provides a graphical canvas to build pipelines powered by Apache Kafka®, with SQL extensibility.

In contrast to Batch ETL, Kafka enables ETL pipelines that are streaming, allowing real-time transformation and distribution of data at large scale.