Many leading lights of the Apache Kafka® community have appeared as guests on Streaming Audio at one time or another in the past three years. But some of its episodes are particularly memorable—and each for its own reasons. Read on for the 10 episodes that I consider particularly worth listening (or relistening) to, along with a key fact from each.
Michael Noll’s four-part blog series Stream and Tables in Apache Kafka is one of the most compact-yet-comprehensive introductions to Kafka and its ecosystem in the literature, covering everything from the definition of an event through to stream processing with ksqlDB and Kafka Streams. In this episode of Streaming Audio, Michael distills his blog series to its essence, presenting its content in a format ideal for those who prefer to learn by listening. Even seasoned Kafka users are likely to learn something from his explanations.
It’s true that the most requested feature for KafkaJS is stream processing à la Kafka Streams. This is an enormous effort, so let’s not pile on Tommy to get it done all at once.
Event-driven architectures have existed for decades, but only in the last couple of years have they begun to become the default for new designs. Simon Aubury of Thoughtworks has been building event-driven systems for large, complex organizations for five years. In this episode of Streaming Audio, he shares what he would have said to Younger Simon about this work if he had the opportunity to give him some advice. Learn why “passive-aggressive events” aren’t good building blocks for systems, what the distinction is between orchestrating and choreographing, and strategies for optimizing event sizes, timestamps, and organizational buy-in.
In a rare episode of Streaming Audio that features a content warning, learn how Patrick Neff of Baader applies machine learning and other data analytics methods, along with Kafka Streams and ksqlDB, to solve a specific industrial problem: the best way to handle chickens on the journey from farm through processing. There are an enormous number of inputs to be tracked during the entire process, and in the quest for high meat quality and humane chicken treatment, Patrick and his team stitch together raw data, ask questions, and compute answers they push back into their Kafka-based system.
Data can be even more valuable when it is used to automate other processes rather than being sent to a dashboard for human consumption. Save the human consumption for the chicken!
This episode of Streaming Audio provides a deep dive into the topic of change data capture (CDC), a key technology for uniting your new data-in-motion design with its neighboring data-at-rest systems (i.e., databases). We talk about the flagship open source tool, Debezium, and its advantages over query-based solutions. Guest Abhishek Gupta of Microsoft has spent most of his career focused on helping developers, and is well versed in the Kafka ecosystem and in how it can be enhanced by Azure’s tooling, such as Azure Data Explorer. Remember: the database should call you; you shouldn’t call it. If that doesn’t make sense, you should give this episode a listen.
This episode features RocksDB founder Dhruba Borthakur, who developed the open source key-value store while at Facebook in order to index social graphs. RocksDB is based on a log-structured merge-tree—a persistent data structure designed to support high-throughput writes and read patterns where recent data is more likely to be accessed. In the Apache Kafka ecosystem, it’s used in ksqlDB and Kafka Streams.
RocksDB tends to run in local, single-node implementations, but Dhruba’s product, Rockset, is a cloud-based, distributed version of RocksDB that seamlessly ingests streaming data from Kafka, indexes it, then allows you to perform real-time analytics on it. On the low level, this episode is a great introduction to log-structured merge trees. On the higher level, it also covers the distinction between traditional real-time analytics geared towards reporting and real-time analytics for powering applications, as we might expect to find in gaming and social media.
Visiting the podcast for this episode are Samuel Benz and Patrick Bönzli of SPOUD, a Swiss startup that works in real-time analytics, helping companies navigate and explore data that lives in Kafka. Sam and Patrick share the four types of Kafka adoption they tend to see, with the most intense and ultimate stage characterized by Kafka serving as an entire architecture’s data integration layer (the vaunted “central nervous system”). Clients enjoy the liberty that this arrangement yields, as it allows a large number of new services to easily come online and perform value-creating computations. Sam and Patrick also discuss what the data collection journey usually looks like: sourcing data and then converting it into a usable format tends to be the hard work in the process; the fun comes when you get to start analyzing it.
Kafka possesses an “anarchic-like nature” on some level, as it’s one central point and theoretically anyone in the organization can listen to something that anyone else put into it. So you do have to manage permissions, but it does successfully break the silos that enterprises have been working on for so long.
Salesforce Principal Architect Pat Helland visits Streaming Audio to discuss the ideas from his ACM paper “Space Time Continuum,” which considers how to handle the problem of incomplete data in distributed systems—for example, what to do when data needs to arrive from multiple external locations and may not arrive at all, or arrive out of event time order. A solution to the problem is to learn how to work effectively with partial results. While explaining his thesis, Pat touches on several other fascinating, tangential topics, including the ways that widely used industry terms such as eventual consistency have fuzzy meanings—and so tend to be both misused and misunderstood. Also Pat is a huge amount of fun to talk to. I think that shows in the episode.
Testing has kept pace with the growth in sophistication of system architectures, and SmartBear’s tools are at the forefront of the field’s advances. SmartBear Director of Product Management Alianna Inzana visits Streaming Audio to describe the current state of testing at SmartBear, whose ReadyAPI platform provides a synthetic testing experience that employs extensive virtualization—allowing you to more adequately model real life as you develop your systems. Both local and third-party services can be virtualized, and Kafka is well supported, letting you easily mock your data in motion. In the episode, Alianna covers data governance and how it can help your organization grow quickly and in an orderly fashion. She also gives us her predictions of how the future of sharing event data may play out.
Learn about how service virtualization lets you prototype services so that you can gain valuable feedback before committing to fully coding them.
Adithya Chandra joined Confluent’s Kafka performance team to apply the knowledge he gained at AWS working on Aurora and Elasticsearch. The performance team’s objective is to make Confluent Cloud completely invisible to its users—able to completely handle any workflow users throw at it without skipping a beat. After having struggled with memory issues at Elasticsearch, Adithya was surprised when he arrived at Confluent to find out how little memory Kafka uses, which is partially a result of its keeping heap usage relatively light. Hear the details of this story and others as Adithya addresses Tiered Storage, Kafka’s usage of the page cache, testing with simulated workloads, finding the best cloud hardware, and more.
After listening to the episodes above, you can—if you’re really, really committed—make your way through the entirety of the Streaming Audio catalog over at Confluent Developer to find more. There are many other episodes that I would have liked to add to this list, so I hope you get the chance to check them out.
Tim Berglund is a teacher, author, and former technology leader with Confluent, where he served as the senior director of developer advocacy. He can frequently be found at speaking at conferences in the U.S. and all over the world. He is the co-presenter of various O’Reilly training videos on topics ranging from Git to distributed systems, and is the author of Gradle Beyond the Basics. He lives in Littleton, CO, U.S., with the wife of his youth, their three children having grown up.