Over the last decade, there’s been a massive movement toward digitization. Enterprises are defining their business models, products, and services to innovate, thrive, and compete by being able to quickly discover, understand, and apply their data assets to power real-time use cases. Under the hood, data pipelines do a lot of the heavy lifting, coordinating data movement, extracting, transforming, and loading data to serve various operational and analytical use cases.
A data pipeline is a process for moving data from one location (a database) to another (another database or data warehouse). Data is transformed and modified along the journey, eventually reaching a stage where it can be used to generate business insights. But of course, in real life, data pipelines get complicated fast — much like an actual physical pipeline that must traverse great distances with twists, turns, and obstacles.
There are different types of data pipelines, generally falling into these three categories:
Extract, transform, load (ETL)
Extract, load, transform (ELT)
Reverse ETL processes
These three processes specify different means of moving data from one place to another while possibly transforming data into a more suitable context for its destination.
Data pipelines can become quite sophisticated and complex, particularly when they’re used in the domains of predictive analytics and machine learning. They can serve data to databases and other systems for operational use cases; data warehouses or data lakes for analytical use cases; and BI platforms for business intelligence use cases. And your organization might have data residing in various locations and formats, including the cloud or multiple clouds, on-premises infrastructure, and serverless options—all of which increase overall complexity when it comes to operating these data workflows while keeping them secure and governed.
A data pipeline is a sequence of data-processing elements where the output of one is the input of the next. It’s a critical primitive in most, if not all types of IT architecture where data is collected, transformed, and routed in order to create and take advantage of analytics. Data pipelines have served the IT industry for many years now and thus tend to be part of a more legacy data infrastructure, though recent developments are helping make pipelines more flexible.
As mentioned, batch-based data pipelines fall into three broad categories: ETL, ELT, and reverse ETL. For the ETL process, data has to be extracted from its original sources — or multiple sources. It’s then transformed to make sure it’s high quality, with transformation processes that include deduplication, standardization, verification, sorting, and cleansing the data. And finally, the data is loaded into the target destination, such as a cloud data warehouse like Snowflake or an operational database like MySQL. With a rise in the popularity and affordability of cloud data warehouses, variations on the ETL process like extract, load, and transform (ELT) and reverse ETL have emerged to share the subsequent analysis of the data from centralized data warehouses back to systems and applications (such as databases, SaaS apps, etc.) to serve operational and BI use cases.
To more specifically describe how the ETL process works, it’s helpful to break it down into the three processes that data encounters as it moves through the pipeline.
Collect — Data is pulled as raw datasets from one or more sources and can take on a variety of formats, such as file names or tables
Transform — The organized data is then cleansed and transformed into a format that’s compatible with the downstream system where the data is being moved to
Share — The transformed data is output into a data warehouse or operational systems where it can be consumed by others
This data pipeline process ultimately renders data available for reliable business use. There are various ways that data pipelines can be architected, but for a continuous data flow that enables different teams to access data in the right format, at the right time, a streaming data pipeline is superior to a batch-based pipeline.
While pipelines have been around for a few decades in the computing world, they’ve experienced challenges with the advent of the cloud and increased volumes of data that organizations have to manage. Along with pipelines and processing still using a legacy batch-based approach, pipelines often exist in silos, and teams often build pipelines for only one purpose, so they aren’t reusable.
Overall, we’ve found five common attributes of traditional data pipelines that pose challenges:
Batch: Pipelines ingest and transform data in periodic batches, leading to immediately stale information and cascading delays driven by periodic scheduling.
Centralized: Pipelines are governed by central teams that often end up becoming a bottleneck since they don’t understand the domain data needed to deliver insights. Also, there is no clear data lineage or ownership, which slows down self-service access and thus innovation.
Ungoverned: Pipelines aren’t consistently governed or observed, and a patchwork of connections leads to an inability to scale and added risk for security and compliance requirements.
Infrastructure-reliant: Traditional ETL pipelines require intensive compute and storage requirements; with growing data volumes, processing data becomes slow and costly.
Monolithic design: Pipelines have typically been rigid and difficult to adapt to new business logic or data, leading to new pipeline creation, which causes increasing pipeline sprawl and technical debt.
Streaming data pipelines provide real-time data flows across an organization, sending data from source to target while being able to continuously enrich and transform data along the way. They can enable the best of two worlds: the ability to tap into real-time streaming use cases while still capitalizing on historical data stored in data-at-rest systems like databases and other operational systems. This is the architecture underlying, for instance, online grocery ordering processes, which must tap into both real-time inventory and a customer’s historic purchasing behavior.
Many sophisticated enterprises are moving toward using streaming data pipelines to reinvent data flows across their organization with a decoupled architecture and different teams able to access and share data to quickly build data products for a variety of use cases. Often, that data is routed across cloud architectures, but sometimes it’s still located on-premises. There are various viable cloud data warehouses in the mix, including Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Azure SQL Data Warehouse.
Streaming data pipelines are also an essential part of offering data as a self-serve product and building a broader data mesh in an organization. Teams can innovate faster, be more productive, and speed up real-time initiatives with streaming vs. older batch pipelines.
At Confluent, our approach to building better pipelines—and solving the challenges of legacy pipelines—rests on five foundational principles:
Streaming pipelines allow a business to maintain real-time repositories of reusable data for immediacy and to facilitate use by many downstream consumers.
Decentralized pipelines let the teams closest to the data create shareable data streams for easy use and reuse.
Declarative pipelines, based on languages like SQL, create representations of a modern data flow journey and eliminate ops details.
Developer-oriented pipelines allow for independent development, versioning, and testing for improved flexibility.
Governed pipelines offer automated capabilities like observability, search, and lineage so teams can safely find, use, and trust data streams.
These capabilities can set you and your teams up to work together more productively and cut out a lot of wasted time and resources. Confluent enables IT teams to start with one streaming data pipeline use case (for example: building streaming pipelines between operational systems or to cloud databases) and expand to other streaming pipeline use cases over time (for example, building streaming pipelines to cloud data warehouses for real-time analytics).
Confluent delivers a modern approach to break down data silos with streaming data pipelines. You can connect data by easily creating and managing data flows with an easy-to-use UI and pre-built connectors; govern data with central management, tagging, policy application, and more; enrich data using SQL to combine, aggregate, clean, process, and shape in real time; build trustworthy data products for downstream system and app use, and share data securely in live streams with self-service data discovery and sharing.
A data pipeline is a set of data processing actions to move data from source to destination. From ingestion and ETL, to streaming data pipelines, learn how it works with examples.