This blog post is part two of a three-part series on how businesses can drive value using real-time data streaming and processing technologies. Stay tuned for part three, coming soon!
Stream processing flips the data processing paradigm – from store then process (as in batch process) to process then store if needed.
In data streaming, analytical queries are predefined and computed, and data flows over the queries. This changed paradigm requires comprehending new concepts. In part two of this three-part series on the business value of real-time streaming, we’ll discuss the importance of data streaming platforms, stream processing engines, and the managed services that support them. By the end of this article, you’ll understand more about the factors to consider when selecting stream processing technology and services to fit your company’s needs, as well as some of the common challenges you might encounter.
The two major battle-tested stream processing engines with the largest mindshare include Apache Flink® and Kafka Streams.
Apache Flink is an open-source stream processing/programming framework developed in Java that enables stream processing at scale with parallel distributed processing and fault-tolerance. Unlike Apache Kafka®, Flink doesn’t store any data. Its processing logic is operating on data that lives somewhere else, such as a topic in a Kafka cluster. Flink provides a rich set of developer APIs to implement parallel, distributed computations like filtering, aggregation, enrichment, and analytics on streaming data over time windows. It has been successfully deployed by enterprises like Alibaba, Netflix, Stripe, Reddit, and Uber for mission-critical use cases like payments processing, real-time budget calculations and projections, spam detection, and more [source].
Kafka Streams is a native library integrated and deployed on Kafka for stream processing. It simplifies stream processing and provides data parallelism, scalability, and fault tolerance. It allows developers to specify processing logic reading data from Kafka, transforming input into new streams and writing back results to Kafka.
Kafka Streams provides two abstractions to work with events and streams—record-based and table-based. The former abstracts a record in a stream as an unbounded sequence of immutable events, while the latter is an accumulated representation of a collection of streams as the current state of the system. Kafka Streams has found widespread usage across various industries by companies like Pinterest, The New York Times, Trivago and many more.
While Flink supports a richer set of APIs, provides lower latency compared to Kafka Streams, and supports complex event processing (CEP) and machine learning (ML) APIs, it is practical to use Kafka Streams when requirements are tightly coupled to Kafka. With Kafka Streams, no extra infrastructure is needed as it runs on top of Kafka, while Flink needs its own cluster and infrastructure.
For most organizations streaming systems need a service-level agreement (SLA) of 24x7x365. Building a distributed streaming platform with this kind of an SLA is extremely challenging, where failure is not a question of ‘if’ but ‘when.’ Architects and engineers need to make critical design choices that impact long-term successful development and deployment of such systems. Not having a clear understanding of the best architectural patterns and critical capabilities can result in fragile and suboptimal systems.
Stream processing is littered with failed technologies. Because of the changed paradigm and principles with stream processing, the technology is complex and has a steep development, deployment, operationalization, and adoption curve.
Stream processing addresses challenges of building real-time analytics applications to react to events with minimal lag from event generation to insight. Loading data to a repository is delayed until results of analytical queries are ready to be served. The reason being, writing to a repository with high throughput data increases latency and delays time to insight.
Modern stream processing platforms have democratized stream processing, with high levels of maturity lowering the bar to entry as well as costs for developing and deploying solutions. Streaming is a fast-growing space with multiple competing vendors. The landscape can be confusing, with terminologies like ”data streaming platforms,” “streaming connectors,” “stream processing engines,” “streaming databases,” “real-time OLAP databases,” and “data lakehouses” that are all associated with streaming capabilities. It is imperative to understand the differences before choosing a solution based on use cases, latency, data volume, processing needs, and SLAs.
A data streaming platform is the critical piece for enabling stream processing and analysis. While Kafka decouples producers from consumers and buffers streaming data for downstream processing and analytics, data streaming platforms like Confluent, AWS MSK, and Azure Event Hub implement Kafka’s asynchronous API and protocol.
Stream processing engines have processors to transform events in the streaming platform. These span from pre-Hadoop to Hadoop and post-Hadoop eras and include solutions like Apache Samza, Apache Storm, Apache Heron, and Spark Structured Streaming. Cloud native solutions like Amazon Kinesis, Google Dataflow, and Azure Stream Analytics also fall into this category. However, within the landscape of solutions, Apache Flink has emerged as the de facto standard for stream processing due to its high performance, rich feature set, and robust developer community.
Streaming databases analyze data that is persisted with low latency and then processed, as compared to traditional databases. Meanwhile, streaming solutions in data lakehouses are more micro-batch oriented. Streaming databases provide high throughput data ingestion and querying over data with a larger window context of data as compared to stream processing frameworks. Some examples of streaming databases include ksqlDB, Materialize, and Rising Wave.
Real-time OLAP databases are databases that can serve high query/sec (QPS) for a smooth user experience. Streaming databases are now capable of both stream processing and serving end user queries with high QPS. Examples include Apache Pinot, Clickhouse, Rockset, and StarRocks.
Ideally, when choosing a data streaming platform for your business, you should look for one that supports low latency and low overhead for recovery with checkpointing and state management to minimize data loss. The platform should provide bindings to multiple languages, especially SQL, Python, and Java. It should support fault tolerance and be capable of handling event spikes—“Streaming Flash Floods”—and avoid single points of failure. The platform should also be capable of handling late-arriving and out-of-order data. The mind map below provides an overview of the critical capabilities enterprises should keep in mind when selecting streaming platforms.
The stream processing hierarchy of needs pyramid below also summarizes the criteria organizations should look for when considering streaming solutions.
Organizations jump on the cloud bandwagon thinking it to be an escape valve from challenges of on-premises data & infrastructure management. Inevitably they run into challenges with stitching together cloud services to build a functioning end to end system. Cloud services for streaming have different dynamics, constraints, and patterns. They often require architectural and system-thinking expertise to scale autonomously, be multitenant, and control costs with unpredictable workload characteristics.
Managed services for stream processing do the heavy lifting to reduce time to insights (TOI) and time to market (TTM) with fine grained configuration and tuning that optimizes performance, throughput, and latency. They abstract away complexities of streaming infrastructures so data engineers and developers can focus on building applications.
Enterprises can choose to deploy Kafka as self-managed or use managed solutions. The choice depends on deployment complexities, SLAs, data residency requirements, and regulations governing the enterprise. Self-managed solutions provide more control with deployments, security, and governance policies, but require in-house skills and bandwidth to setup, manage, troubleshoot, and optimally configure infrastructure and storage for servers running Kafka with monitoring, DR (disaster recovery), HA (high availability), and replication.
Managed deployments can be expensive but they also save time, cost, and skilled resources in the long run. They do not fully alleviate the need for in-house Kafka skills, but typically include automated provisioning of infrastructure (servers, network, storage) and services around updates, monitoring, alerting, HA, replication, etc. To build reliable, fault tolerant, and highly available streaming applications, it is critical to leverage the enterprise-ready capabilities of a managed platform.
Confluent recently introduced Kora Engine, the engine that powers Confluent’s managed service, Confluent Cloud, to be elastic, resilient, performant, and cost effective. Kora is 100% compatible with Kafka's protocol, but powers Confluent Cloud to be a cloud-native data streaming platform that abstracts all the infrastructure and operations behind managing and scaling Kafka deployments. Kora automatically manages data locality between memory and object storage to maximize throughput while minimizing cost and latency with data recovery.
Apache Flink needs dedicated developers and operations engineers for resource management during peak capacity to avoid latency spikes with bursty traffic. It also needs instance sizing, state management setup, containerization, elasticity, configuration, tuning, monitoring, and integration with Kafka.
Through the recent acquisition of Immerok, Confluent is bringing to market its Apache Flink-powered stream processing solution. Confluent has introduced the open preview of its fully managed Flink service on Confluent Cloud. With this offering, Confluent is strongly positioning itself as the go-to solution for stream processing and analytics.
Managed services, like Confluent, support command line interface (CLI) and RESTful APIs and UIs to build and manage streaming solutions. They provide serverless deployment options allowing self-service provisioning to elastically scale clusters based on workload. This frees organizations from the burden of configuring hardware and the associated operational complexities. Additionally, with managed services, organizations can reduce downtime, minimize data loss, and improve mean time to recovery (MTTR) with high-availability clusters and horizontal scalability options.
Managed services also lower the bar to entry for organizations looking to build streaming applications. However, they have guardrails and restrictions around flexibility of managing infrastructure and configuration as compared to non-managed services. Streaming data architects and engineers need to carefully evaluate the choice of using managed services vis-à-vis a DIY (do-it-yourself) approach to streaming solutions.
In the rapidly evolving data landscape, data streaming addresses the need for real-time data integration, processing, and analytics, enabling organizations to harness the power of data in motion. However, leveraging streaming requires a deep understanding of underlying patterns, principles, and pitfalls. Selecting the right data streaming platform and strategies to overcome these challenges will eventually impact the ROI an enterprise derives from streaming solutions.
For more information on what to look for when choosing a data streaming platform and whether a managed service is right for you, check out Confluent’s Build vs Buy guide.
You can also head over to Confluent’s data streaming resources hub for the latest explainer videos, case studies, and industry reports on data streaming.
Interested in learning more about the new data streaming era and how Flink fits in? Check out this Forbes article featuring Confluent CEO, Jay Kreps.