Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
I remember the first data streaming application my team and I built. It started out with a single microservice that pushed data into a stream. Of course, the next step was to introduce a second service to consume that data. And once we had two microservices, it wasn’t long before we added a third, a fourth, and so on.
Initially, anyone on the team was able to very quickly sit down at a whiteboard and draw out exactly how data was flowing between the various services. Sometimes we would do this multiple times a week as we discussed new ideas or features.
But over time, we began to encounter problems. As the system grew, we started to lose track of what data was available, or how it flowed through the system. This only got worse as the system evolved and the flows changed. We weren’t always aware of every change that was made, even if it was our own team making it, and soon our knowledge became outdated. Of course, we tried to keep track of this information in a wiki, but as you can imagine, it quickly grew stale.
We wanted everyone in the company to benefit from the streams of data we were building. We saw the incredible advantages that came from using those streams to build new features and we wanted everyone to share in that success.
But there were problems. The other teams lacked the level of insight that we had. They didn’t know what data was available, where to find it, or how to consume it. Discoverability was a real issue. Soon, we felt like we were an isolated team doing our own thing, rather than part of a larger cohesive whole. We wanted to be the central nervous system, but instead, we felt a little like a third arm that everyone tried to ignore.
This is a difficult pattern to break out of. When moving to a data streaming architecture, we often start with a single team working on the problem. Once the systems are in place, expanding them beyond the immediate team and integrating them with other systems becomes tricky. People tend to fear new things and are reluctant to ask for help. They would prefer if everything were documented so they could find what they need on their own. But it doesn’t take long for those documents to grow stale as the application evolves.
When I first encountered these problems, my team and I were working on a custom streaming system that we had built internally. It predated Apache Kafka® and other modern solutions. But the problems we faced then are still present today. In fact, as data streaming has become integral to digital transformations, it has only grown more difficult to ensure that everyone can discover and use the data they need.
Today, solving these problems falls under the umbrella of Stream Governance. The idea behind Stream Governance is to create a system of data streams that are reliable, discoverable, and secure so that we can provide the central nervous system that drives the critical aspects of the business.
But we don’t achieve this with a team of people writing documentation that quickly grows stale. Instead, we achieve it with automated tooling.
This is where Confluent Cloud’s Stream Governance comes in. It provides a set of tools designed to help us overcome the challenges I encountered earlier in my career.
Let’s take a quick look at how it does that.
One of the key challenges that streaming systems face is how to effectively handle schema evolution. We need to ensure that the data we publish remains compatible with downstream consumers. Over time, the data tends to evolve, and if we aren’t careful it is easy to accidentally make a change that will break the flow. This erodes trust in the data and makes people reluctant to rely on it.
This is where Confluent Schema Registry can help. It provides a central authority to manage and maintain data schemas. But it does more than that. Within the schema registry, we can establish rules on how a schema can evolve to ensure that it remains backward compatible, or even forward compatible if that’s the goal. If we try to update a schema in a way that conflicts with our validation rules, then the schema update will be rejected. By applying and enforcing these standards, we ensure our streams maintain a high level of quality. This helps bolster trust and encourages people to rely on the streams.
Another challenge we encounter is the need for discoverability. Other teams aren’t going to start consuming our streams if they don’t know what data is available to them. And external documentation about the streams tends to grow stale and loses its value.
Confluent Stream Catalog provides a way to search our existing streams to locate specific data. This allows users to rapidly find what they need, without having to consult with outside experts. It also allows us to tag streams with additional metadata that can further enhance search capabilities. For example, if we wanted to mark certain streams to indicate they contain personally identifiable information (PII), we could leverage the catalog’s tagging features to do just that. This data is all tied directly to the schemas that are being actively used by our applications. That makes it difficult for it to grow stale because a stale schema wouldn’t still be in use.
The catalog also makes it easy for unfamiliar developers to find what they need. Using the catalog they can quickly locate streams of data and see exactly what the schema will look like. From there it’s a simple matter to start consuming that data using the schema, especially if they leverage code generation to convert the schema into compilable code.
And as the system grows, continually drawing it on a whiteboard becomes difficult. Meanwhile, diagrams embedded in documents or wiki pages have all become stale and are no longer reliable. Even just navigating between the various components can become a challenge if we don’t understand how those components connect together.
Confluent Stream Lineage maps our data streams in near-real time based on the data flowing through the system. It shows us where data originates, where it’s going, and each step along the journey. It’s trivial for us to jump into the Stream Lineage and see the flow of our data, not just right now, but even at points in the past. We can easily visualize how the data flow has evolved over time. And, because this is built from the live data streams, it’s never stale. You can trust that it is showing you the current state of the system, rather than an imperfect human interpretation of what it might have looked like a week ago.
I look at the tools provided by Confluent Cloud, and I can’t help but wonder how things might have been different if I had had access to Stream Governance in my old system. Much of our internal documentation would have been made unnecessary by these tools. It might have been easier to convince other teams to use our streams if we had been able to point to a catalog where they could find what they needed on their own. Especially since they could trust the data would be up to date, and would remain compatible as the system evolved. And I have to say, I’m a huge fan of visualizations. I would have loved to sit down in a meeting with other teams and show them a map of how the data is flowing through the system, and how their new systems might fit into that world. That alone might have convinced them it was worth the price of admission.
If you are like me and you are looking for better ways to manage your data streams at scale, then it’s worth taking a look at the Stream Governance features in Confluent Cloud. And if you want to get a head start with some hands-on work, check out the new Governing Data Streams course on Confluent Developer.
Data governance initiatives aim to manage the availability, integrity, and security of data used across an organization. With the explosion in volume, variety, and velocity of data powering the modern […]
Data pipelines are critical to the modern, data-driven business, connecting a network of data systems and applications to power both operational and analytical use cases. With the need to promptly […]