Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
There was a time not so long ago when “Big Data” was the hottest thing since sliced bread. Developers talked about it. Managers wanted it. Presenters made memes about it. Children asked their parents for Hadoop clusters for Christmas (okay, maybe not quite that far).
Big Data was going to change the way everything worked. We were about to solve every financial, medical, scientific, and social problem known to humankind. All it would take was a great big pile of data and some way to process it all.
But somewhere along the line, the big data revolution just sort of petered out, and today you barely hear anything about big data. Today, even saying “big data” is like exhibiting the millennial pause – you date yourself as soon as you open your mouth.
So what happened? Where did it go?
Big data very much followed the typical hype cycle. But given the pronouncements of curing cancer and solving world hunger via improved food distribution, big data’s hype peak wasn’t simply constrained to the ill-conceived open offices of countless developers. Rather, it was plastered all over the internet, newspapers, and the blogosphere (remember that?). Products whose only commonality with Hadoop was that it required more than one server were suddenly “big data systems,” perfect for serving the entire needs of the internet at web scale.
The ride down into the trough of disillusionment resulted in a healthy industry-wide change. Shouting “big data!” over and over again was no longer enough to get attention (and funding). You actually needed to prove that your company or technology could do something, and that something needed to be demonstrably valuable. For other established giants, such as Amazon, it was about figuring out what new products to develop based on the learnings from Hadoop.
While big data didn’t die per se, it did go through a metamorphosis, going from new and challenging to normal and boring. In fact, many of the common products and tools that we use today have their roots firmly in big data.
The Hadoop Distributed File System (HDFS) was the unruly pioneer, rough at its edges, but truly a world first as a readily available massively distributed file system. S3 is the ubiquitous, more boring, mature, and easier-to-use descendant. Gone is the operational burden, and in its place is effectively a magic bag of holding for all of your file storage needs.
Amazon’s S3 replicated and improved upon the distributed filesystem model of HDFS while preserving the original interface. Gradually and incrementally, S3 came to replace HDFS as the file system of choice for many big-data use cases, eventually replacing even non-big-data use cases like file backups and asset serving. The big data filesystem slipped into the fabric of everyday storage primitives, abstracting away the often finicky inner workings of the cloud file system.
MapReduce, rooted in functional programming, inverted the execution model: instead of bringing the data to the code, the code was brought to the data. Executing in the distributed nodes of the Hadoop cluster, MapReduce would bring about distributed wrangling of 1000s of TBs of data. But it was slow, it relied extensively on writing to spinning disks. Apache Spark™ jumped in to take its place, promising (and delivering) 100x speed improvements on common jobs, and eventually moved away from the low-level RDD model into the higher-level models of Dataframes and, finally, SQL. Alongside Apache Pig (remember that hog?), what was once a specialist language for deriving big data results was boiled back down into simple SQL queries.
Another hallmark of big data, and with it, Hadoop, was its resiliency to failures. Any data processing jobs you executed were meant to keep processing even in the case of one or more failures. A failed power supply, a seized hard drive, or even bits clogged up in the network pipe would be recoverable, with the bad or unresponsive node(s) removed from the system and work redistributed back to the remaining healthy processing nodes. Hadoop was designed to handle failures as part of regular operation and not as an exceptional case, a design philosophy that has since carried over into the massive cloud-scale services and architectures we see today. When was the last time S3 told you that a node serving your data had crashed?
While big data itself lingers in our collective terminology, its greater impact is on how it redefined what it means to have big scale. The influence of the HDFS file system continues to underpin our entire cloud service experience. An invisible resiliency and recovery to failures is not only an expectation, but a requirement, for any large-scale cloud service. In fact, our very own Confluent Cloud relies heavily on cloud storage to serve as our infinite retention backing layer.
And finally, big data has shown us that no matter how hard we try, there’s simply no escaping from the inevitable convergence to a full SQL API.
Data mesh. This oft-talked-about architecture has no shortage of blog posts, conference talks, podcasts, and discussions. One thing that you may have found lacking is a concrete guide on precisely […]
Experienced technology leaders know that adopting a new technology can be risky. Often, we are unable to distinguish between those investments that will be transformational and those that won’t be worthwhile. This post examines how one can decide if event streaming makes sense for them.