Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
On a recent episode of Streaming Audio, Gwen Shapira, Michael Noll, and Ben Stopford joined me to hold forth about the near future of Apache Kafka® and software architecture in general. House rules were that predictions could cover any topic, but they had to be “precise” in the spirit of Bob Metcalfe, who, back in the 90s, famously predicted a particular day the dot-com bubble would pop, under the theory that it was better to be wrong than vague. At least we all felt pretty sure that our predictions couldn’t possibly be worse than the ones being made at the same time in 2019.
We began, fairly enough, with Kafka itself. Gwen started us off by predicting that by the end of the year it will be possible to run a Kafka cluster with 10 million partitions, facilitated by some consequential architectural changes: KIP-405 (Tiered Storage) and KIP-500 (ZooKeeper removal). These KIPs enable growth in the number of partitions by moving data out of the cluster proper and enabling metadata to be managed in a more scalable and robust way.
Ben predicted being able to double the size of a Kafka cluster in seconds, a task specifically enabled by Tiered Storage. Tiered Storage is often understood simply as a cost-savings and storage play, which is fair enough. People using Kafka as a system of record tend to want longer retention periods, and Tiered Storage is an obvious enough improvement to the economics of that architecture, but it doesn’t stop there. You also get quick autoscaling. Because so much state gets offloaded to your friendly neighborhood cloud object store, when you go to scale brokers, there is significantly less data to move around.
Another architectural benefit is a potential performance boost: data in the object store tier is accessed over the network, with the presumption that it is accessed less frequently than data still on disk. If one plays one’s cards right, that local hotset can fit entirely into the broker’s page cache, making I/O on the hotset a vastly faster proposition. Remember, when it comes to data access patterns, the power law works for you; you don’t work for it.
Michael’s prediction, for which there was consensus (see what I did there, KIP-595?), was the continued growth of Kafka-like streaming features in products across the data landscape: from relational stalwarts like Oracle, to Redis, to traditional messaging systems like RabbitMQ. Users have increasingly come to expect features that will let them work with real-time, unbounded datasets, and vendors tend to notice things that users expect.
Given that this transition to understand systems “events first” is well underway and already looks rooted in the emerging software architecture consensus, I will see Michael’s prediction for the year and raise him another couple of decades: I predict event streaming will be seen as the dominant paradigm of this generation’s software architectures.
So it’s clear that event streaming is happening, but another question is how it can best be added to existing database products, since many existing tools came to life before this paradigm was yet a thing. To begin with, companies each have their own idea for how streaming should even be defined, as Michael has seen firsthand with his work on the committee writing the SQL standard’s streaming extension. And it can be hard to retrofit an existing product built under batch- or state-oriented assumptions, particularly when one wants to operate it at scale. As Gwen pointed out, it may even require completely new data structures to make a truly successful multi-paradigm solution.
As a side note, the broader Kafka ecosystem is making its own claim on multi-paradigm status, since it began with streaming and later added ksqlDB, which brings database concepts and SQL itself into an event-driven system.
So our money is on the table. I must say that I would be surprised if when I’m starting to roll into my Christmas playlist in October of 2021, streaming isn’t even more on the minds of those in the industry than it is now. I don’t know that it will be completely mainstream, but if you’re not doing it already by then, or at least thinking seriously about it, you might start to feel a bit behind the zeitgeist. It would also be surprising if by that same time the effects of KIP-500 and the rest of the gang haven’t started to make their mark in the community’s collective imagination, as we continue to think about what we might build with Kafka next.
If you want to hear the episode for yourself, have a listen to Streaming Audio and make sure to subscribe through Apple Podcasts or wherever fine podcasts are sold.
Apache Kafka and stream processing solutions are a perfect match for data-hungry models. Our community’s solutions can form a critical part of a machine learning platform, enabling machine learning engineers to deliver real-time MLOps strategies.
Learn how modern data management approaches like data mesh and event-driven architecture (EDA) can be used to manage data platforms and how to take advantage of them.