Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
A data pipeline moves raw data from various sources into a data store for further analysis. Modern data pipeline systems automate the ETL (extract, transform, load) process through data ingestion, processing, filtering, transformation, and movement across any cloud architecture and add additional layers of resiliency against failure.
Learn how to build a real-time data pipeline in minutes.
With data coming from numerous sources, in varying formats, across different cloud infrastructures, most organizations deal with massive amounts of data - and data silos. Without a complete, unified view of your data, you won't be able to uncover deep insights, improve efficiency, and make informed decisions.
This is why data pipelines are critical. It's the first step to centralizing data for reliable business intelligence, operational insights, and analytics.
To understand how a data pipeline works, let’s take any pipeline that receives something from a data source and carries it to a destination. This process of transporting the data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization, is known as data ingestion.
Along the way of transportation, the data undergoes different processes depending on the business use case and the destination itself. A data pipeline may be a simple process of data extraction and loading, or, it may be designed to handle data in a more advanced manner, such as a data warehouse for predictive analytics or machine learning.
As data moves through a pipeline, there are four processes that occur: extract, govern, transform, and data virtualization.
Each data pipeline begins with a dataset, or a collection of raw datasets extracted from any number of sources. The data comes in wide-ranging formats, from database tables, file names, topics (Kafka), queues (JMS), to file paths (HDFS). There is no structure or classification of the data at this stage; it is a data dump, and no sense can be made from it in this raw form.
Once the data is ready to be used, it needs to be organized at scale, and this discipline is called data governance. By linking raw data to its business context, it becomes meaningful. Enterprises then take control of its data quality and security and fully organize it for mass consumption.
The process of data transformation cleanses and changes the datasets to bring them the correct reporting formats. This includes eliminating unnecessary or invalid data, and data enrichment in accordance with any rules and regulations determined by the business’ needs.
After the data is transformed, trusted data is finally ready to be shared. It is often output into a cloud data warehouse or endpoint application for easy access by multiple parties.
Used by Walmart, Expedia, and Bank of America, today, Confluent is the only complete data streaming platform designed to stream data from any source, at any scale. Built by the original creators of Apache Kafka, today, its streaming technology is used by 80% of the Fortune 100. Capable of not only real-time data ingestion, Confluent enables large scale, streaming data pipelines that automate real-time data flow across any system, application, or data store with 120+ pre-built connectors.