Bâtissez votre pont vers le cloud en temps réel avec la plateforme Confluent 7.0 et Cluster Linking | Lire le blog

Batch Processing vs Real Time Data Streams

The world generates an unfathomable amount of data every minute of every day, and it continues to multiply at a staggering rate. Companies have quickly shifted from batch processing to data streams to keep up with the ever growing amounts of big data. In this article, we’ll cover what data streaming is, how it works, and how your organization can benefit from real-time streams of data.

Data Streaming

What is data streaming?

Streaming data is a flow of data records generated by various data sources. It has many similarities to water flowing through a river or creek. The water stream could be from rainfall, lakes, creeks, they can combine into larger streams, and the pace of the flow varies in real-time. Similarly, data streams could be generated by all types of devices and activity such as server log files, meta data from applications, social media activity, banking transactions, sensors, analytics engines, customer interactions, push notifications. Each water drop and data structure is small in size and will vary in flow and volume at any time. Data streams flow into data lakes and applications can sample, analyze and process the data that flows in real-time not needing to wait until the data collects in the lake.

Modern data processing has progressed from legacy batch processing of data towards working with real-time data streams. Similarly, consumers now stream (data) like movies on Netflix or songs on Spotify instead of waiting for the entire movie or album to be downloaded. Data streams are a key part in the world of big data.

Read on to learn a little more about how it helps in real-time analyses and data ingestion.

How data streaming works

Legacy infrastructure was much more structured because it only had a handful of sources that generated data and the entire system could be architected in a way to specify and unify the data and data structures.

Modern data is generated by an infinite amount of sources whether it’s from hardware sensors, servers, mobile devices, applications, web browsers, internal and external and it’s almost impossible to regulate or enforce the data structure or control the volume and frequency of the data generated. Applications that analyze and process data streams need to process one data packet at a time, in sequential order. Each data packet generated will include the source and timestamp to enable applications to work with data streams.

Applications working with data streams will always require two main functions: storage and processing. Storage must be able to record large streams of data in a way that is sequential and consistent. Processing must be able to interact with storage, consume, analyze and run computation on the data.

This also brings up additional challenges and considerations when working with data streams. Many platforms and tools are now available to help companies build streaming data applications.

Challenges Building Data Streaming Applications

Scalability: When system failures happen, log data coming from each device could increase from being sent a rate of kilobits per second to megabits per second and aggregated to be gigabits per second.

Adding more capacity, resources and servers as applications scale happens instantly, exponentially increasing the amount of raw data generated. Designing applications to scale is crucial in working with streaming data.

Ordering: It is not trivial to determine the sequence of data in the data stream and very important in many applications. A chat or conversation wouldn’t make sense out of order.

When developers debug an issue by looking an aggregated log view, it’s crucial that each line is in order. There are often discrepancies between the order of the generated data packet to the order in which it reaches the destination. There are also often discrepancies in timestamps and clocks of the devices generating data. When analyzing data streams, applications must be aware of its assumptions on ACID transactions.

Consistency and Durability: Data consistency and data access is always a hard problem in data stream processing. The data read at any given time could already be modified and stale in another data centre in another part of the world. Data durability is also a challenge when working with data streams on the cloud.

Fault Tolerance: this is always something to consider when working with data stream processing and any distributed system. How do you prevent data loss, or recover lost data across all your systems?

Batch Processing vs Real-Time Streaming - What's the Difference?

All industries that are generating data continuously will benefit from processing streaming data. The use cases typically start from internal IT systems monitoring and reporting like collecting the data streams generated by employees interacting with their web browser and devices and the data generated by its applications and servers. The operations of the company and its products benefit from data stream processing of sensors, equipment, data centers and many more sources.

Since its customers and partners also consume and process streaming data, the ability to send, receive, process streaming data becomes increasingly important. As more companies rely on its data, its ability to process, analyze, apply machine learning and artificial intelligence to streaming data is crucial.

The key differences in selecting how to house all the data in an organization comes down to these considerations:

Batch Processing vs Real-Time Streaming - What's the Difference?

Batch processing is when the processing and analysis happens on a set of data that have already been stored over a period of time. An example is payroll and billing systems that have to be processed weekly or monthly.

Streaming data processing happens as the data flows through a system. This results in analysis and reporting of events as it happens. An example would be fraud detection or intrusion detection. Streaming data processing means that the data will be analyzed and that actions will be taken on the data within a short period of time or near real-time, as best as it can.

Real-time data processing guarantees that the real-time data will be acted on within a period of time, like milliseconds. An example would be for-real time application that purchases a stock within 20ms of receiving a desired price.

Here’s a breakdown of major differences between batch processing, real-time data processing, and streaming data:

Batch Data Processing Real-Time Data Processing Streaming Data
Hardware Most storage and processing resources requirement to process large batches of data. Less storage required to process the current or recent set of data packets. Less computational requirements. Less storage required to process current data packets. More processing resources required to “stay awake” in order to meet real-time processing guarantees
Performance Latency could be minutes, hours, or days Latency needs to be in seconds or milliseconds Latency must be guaranteed in milliseconds
Data set Large batches of data Current data packet or a few of them Continuous streams of data
Analysis Complex computation and analysis of a larger time frame Simple reporting or computation Simple reporting or computation

Many companies are finding that they need a hybrid approach to data processing. Where some real-time data processing is required for real-time insights and also stored for more complex analysis and reporting achieved through batch processing.

Organizations differ on their stance on trusting open source software or proprietary software and the community behind it. Data lakes are popular because of the widespread adoption of Hadoop and the rise in unstructured data from various systems used across the company and real-time data streams. Another aspect of technology to consider is the accessibility and fidelity of updating the system when data sources and structures change. It is more costly to update the relational database and data warehouse where as changes are simple with a data lake.

Main Differences

Unlike a data lake, a database and a data warehouse can only store data that has been structured. A data lake, on the other hand, does not respect data like a data warehouse and a database. It stores all types of data be it structured, semi-structured, or unstructured.

Data Lake vs Data Warehouse

Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

Data Warehouse vs Database

Data Warehouse vs Database. Data warehouses and databases are both relational data systems, but were built to serve different purposes. A data warehouse is built to store large quantities of historical data and enable fast, complex queries across all the data, typically using Online Analytical Processing (OLAP).

Talk about what these have in common

Data systems have mostly focused on the passive storage of data. Phrases like “data warehouse” or “data lake” or even the ubiquitous “data store” all evoke places data goes to sit. But in the last few years a new style of system and architecture has emerged which is built not just around passive storage but around the flow of data in real-time streams.

Modernize Legacy Systems with Real-Time Streams of Data

Data Lake, Data Warehouse, Database: What's the difference?

Data lake:

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.

Data warehouse:

A data warehouse is a large store of data accumulated from a wide range of sources within a company and used to guide management decisions.

Database:

Database, also called electronic database, any collection of data, or information, that is specially organized for rapid search and retrieval by a computer. Databases are structured to facilitate the storage, retrieval, modification, and deletion of data in conjunction with various data-processing operations.

Data Lake vs Data Warehouse

Data lakes and data warehouses are used in organizations to aggregate multiple sources of data but vary in its users and optimizations. Think of a data lake as where streams and rivers of data from various sources meet. All data is allowed, no matter if it is structured or unstructured and no processing is done to the data until after it is in the data lake. It is highly attractive to data scientists, applications that are leveraging the data for AI/ML where new ways of using the data are possible. A data warehouse is a centralized place for structured data to be analyzed for specific purposes related to business insights. The requirements for reporting is known ahead of time during the planning and design of a data warehouse and the ETL process. It is best suited for data sources that can be extracted using a batch process and reports that deliver high value to the business.

Another way to think about it is that data lakes are schema-less and more flexible to store relational data from business applications as well as non-relational logs from servers, and places like social media. Where as data warehouses rely on a schema and only accepting relational data.

Data Lake vs Data Base

Data warehouses and databases both store structured data but were built for differences in scale and number of sources. A database thrives in a monolithic environment where the data is being generated by one application. A data warehouse is also relational and is built to support large volumes of data from across all departments of an organization. Both support powerful querying languages and reporting capabilities and is used by primarily the business members of an organization.

Talk about what these have in common

Typically an organization will require a data lake, data warehouse and database(s) for different use cases. All three focus on centralizing data into a place to sit and enable different parts of the business to analyze and uncover insights. There are trends of new architectures that extend the warehouse to include data lakes and support data science analysis and a shift from an extremely large passive lake to actioning on a real-time streams to support massive scale.