Kora Engine, Data Quality Rules and more within our Q2'23 Launch | Register for demo
Decentralized architectures continue to flourish as engineering teams look to unlock the potential of their people and systems. From Git, to microservices, to cryptocurrencies, these designs look to decentralization as a method of breaking apart centralized bottlenecks. Data mesh is an approach to data and organizational management centered around decentralizing control of data itself. In this post, we’ll look at a Confluent Developer video-led course that tackles the big concepts and walks you through creating your own data mesh using event streams and Confluent Cloud.
A properly implemented data mesh can bring rigor to your company’s data practices, introducing the means to access and use important data across your organization. It enables you to scale your data architecture both technologically and organizationally, eliminating ad hoc point-to-point connections in your data pipelines. Data mesh brings selected business data to the forefront, exposing it as a first-class citizen for systems and processes to couple on directly.
Microservices, data fabric, data marts, event streaming, and domain-driven design influence the data mesh. These influences can be summed up by data mesh’s four principles—data ownership by domain, data as a product, self-service data platform, and federated governance.
In the course’s second module, you can kick off the creation of your own data mesh prototype. First, the instructions walk you through setting up a Confluent Cloud account which will be the foundation of the prototype. After cloning a GitHub repository with the prototype code, a script you run will provision the necessary cloud resources, including an Apache Kafka® cluster and a ksqlDB application. The data mesh prototype uses the Confluent Cloud stream catalog for metadata storage and search. In the later modules of the course, you’ll walk through the prototype features including exploring and creating data products.
In a data mesh, each data asset is curated by the domain team that is most familiar with it and thus most likely to provide a high grade of stewardship. This model is quite different from the data warehouse antipattern, where a single, generalist team manages all of the data for the whole organization and is often more focused on the technical details of the data warehouse than the quality of the data itself. Organizations that implement a data mesh need to clearly define which domain team owns which data set, and all teams must be willing to make changes quickly so that their data in the mesh is always of good quality.
Teams working in a data mesh selectively publish their data for the benefit of the other teams—their internal customers. Data becomes a first-class citizen, complete with dedicated owners responsible for its quality, uptime, discoverability, and usability, with the same level of rigor that one would apply to a business service. The data products published by the respective teams working in a mesh are similar in many ways to microservices, but data is on offer—not compute.
Although the data mesh ideal is based around the decentralized management of data, one of its chief requirements is a centralized location where all members of an organization can find the data sets they need. Both real-time and historical data should be made available (preferably stored in a Kappa architecture), and there should be an automated way to access the data. A plug-and-play tool to fulfill this principle doesn’t currently exist, but it could be accomplished with a UI, an API, or even a wiki. Data product management, including adding, updating, and removing data products is another important facet of self-service. The barriers to entry and management should be as low as possible to facilitate usage.
Applying federated governance in your data mesh ensures that teams will always be able to use the data that is available to them from other domains. Global standards should be created and applied across the mesh, and they can take the form of data contracts, schemas, and schema registries, as well as consistent error detection and recovery practices. Strategies like logging, data profiling, and data lineage can help you detect errors in the mesh. Be pragmatic and don’t expect your governance system to be perfect: It can be challenging to strike the right balance between centralization and decentralization.
Switching an organization to a data mesh is not something that is immediately achievable—it has to be gradual. Management commitment is an essential first step to buy-in across the business. Once you have that, you can assign a few forward-looking teams the responsibility to produce their data as a product. This paves the way forward for other teams, and provides valuable learnings about how to construct data products, factoring in the needs of your consumers while creating the foundations of federated governance. Alongside the mesh you will likely be implementing some concepts from other systems, such as microservices and domain-driven design (DDD). So while your data mesh may not appear overnight, if you make deliberate changes based on its principles, you will eventually start to recognize its appearance in your organization and will begin reaping its benefits.
The course’s second module covered the process of building the data mesh prototype including the provisioning of cloud resources to support it. Once built, the prototype runs a web-based application featuring a loose workflow for exploring, creating, and publishing data products. In this module use the Explore page of the application to see how the prototype uses the stream catalog to allow you to browse data products and view their metadata.
In the final module of the course, we look at creating and publishing a new data product. Using the provisioned Confluent Cloud ksqlDB app, execute basic SQL statements to create new event streams. Finally, publish the data products to the stream catalog by providing the necessary metadata attributes, allowing other users of your data mesh to discover and use them.
If you’re ready to delve even deeper into the topic of data mesh, check out the resources below:
DATAMESH101for $25 of free usage
The ML and data streaming markets have socio-technical blockers between them, but they are finally coming together. Apache Kafka and stream processing solutions are a perfect match for data-hungry models.
Breaking encapsulation has led to a decade of problems for data teams. But is the solution just to tell data teams to use APIs instead of extracting data from databases? The answer is no. Breaking encapsulation was never the goal, only a symptom of data and software teams not working together.