Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now

Streaming Data Product Lifecycle Management

Written By

Abstract: Traditional data management approaches, with their heavy centralization and batch-based nature, are struggling to keep up with the demands of today's organizations. That's why many companies are turning to new paradigms like data mesh and event-driven architecture (EDA) to address their data management needs. This post dives into how these modern approaches can be used to manage data platforms and how to take advantage of them. Specifically, we'll show you how to formally define streaming data products using descriptor files and how these files can be used to automate the entire lifecycle of a data product, from deployment to retirement. We'll use the open Data Product Descriptor Specification (DPDS) for defining descriptor files and the Open Data Mesh (ODM) Platform as an example to demonstrate how it all works in practice. So, if you're looking to improve your data management strategy, read on!

Houston, we have a problem

Traditionally, analytical data platforms have been characterized by a highly centralized technological and organizational architecture and by predominantly batch-integration modes. These two aspects, following the constant growth of the number of data producers and consumers, have generated several problems, including difficulty in evolving, increased management complexity, and a reduction in agility.

It is therefore not surprising that many organizations see data and its management as a strategic element, but also one of the main barriers to innovation. In particular, it is precisely the constant growth in the complexity of integrations and the resulting costs that make the current paradigm unsustainable. In this sense, there are two interesting trends that organizations face: data mesh and event-driven architectures (EDA).

Data mesh and EDA to the rescue

Data mesh is a holistic approach that emphasizes the importance of both organizational and technological factors in building decentralized data management systems. This approach prioritizes decentralization of data ownership and management of data as products, with a focus on establishing a federated governance system to define global policies and ensure interoperability. Additionally, a self-serve platform is crucial for enabling cross-functional teams to access and utilize data in a decentralized manner.

Event-driven architectures (EDA), on the other hand, are based on the exchange of events on a common communication medium that decouples producers and consumers without the need of a centrally predetermined workflow: this allows applications to be managed independently. At the infrastructure level, streaming platforms enable the reliable, secure, and scalable collection and distribution of events in real time between different applications. This architectural pattern is widely used in the operational world and is also spreading into the analytical world, thereby promoting the unification of the two integration stacks in an architecture generally referred to as a digital integration hub (DIH). In this architecture, all data enters the integration platform in real time as events (kappa architecture) and is then transformed and rematerialized in different storage systems for different workloads.

Data mesh and event-driven architecture are not alternative choices. Many organizations are implementing a combination of the two approaches to achieve optimal results. Regardless of the specific approach taken, the idea of managing data as a product and automating as much as possible the data management activities through a self-serve platform are common pillars in all modern data strategies. Before delving into a hands-on example to see them in action, let's briefly introduce these two key concepts.

Data products

Managing data as a product means applying to the development of integration logic (i.e., collection, transformation, and distribution) the same principles and operating models that are used in the development of traditional applications. Three basic elements define a data product:

  • Ownership: Each data product must have a clear owner, responsible for its design, development, and maintenance.

  • Scope: Each data product must have a clear scope that unambiguously defines what is part of it and what is not. A data product is composed of data, metadata, application code, and infrastructural resources—everything that is required to manage it as a single independent unit of deployment.

  • Interfaces: The public interfaces of a data product include input and output ports that manage incoming and outgoing data. Interfaces also include other useful ports for operational management, such as control ports, discovery ports, and observability ports.

To ensure interoperability between the various data products, it is necessary to centrally define shared specifications and protocols. This is a responsibility of the federated governance team, while the product teams are responsible for complying with these standards.

Self-serve platform

To facilitate the development of data products it is useful to centralize some common services through a shared self-serve platform. This platform enforces the shared standards, makes it easy to create and manage data products, and reduces both the TCO and manual management activities. Its services can be organized into these three groups:

  • Utility plane: Provides services with access to the underlying infrastructural resources. The utility plane decouples its consumers (i.e., the data product experience plane) from the actual applications and resources provided by the underlying infrastructure.

  • Data product experience plane: Provides operations to manage the lifecycle of data products—creation, update, versioning, validation, deployment, search, and decommissioning.

  • Mesh experience plane: Provides a marketplace where data products can be searched, accessed, and queried.

Confluent Cloud provides all the basic elements necessary to develop (i.e., Connectors, ksqlDB, Stream Designer), validate (i.e., Schema Registry), deploy (i.e., Terraform provider), and operate (i.e., Stream Governance) a data product that manages real-time data throughout all its lifecycle. A simple example of how the individual services listed above can be used in practice to create a streaming data product is shown in this blog post on how to build data mesh with event streams and in the associated e-book.

The services exposed by Confluent Cloud are already self-serving and declarative in nature. Therefore, although not much work is needed at the utility plane level, it is still important to implement services at the data product experience plane level. This allows for easy and atomic orchestration of the underlying utility plane's services in key phases of the lifecycle of a data product such as creation and deployment.

Track and trace data product

With an example inspired by the logistics world, let’s see how to handle a stream application not as an ad hoc integration flow, but as a real data product. The example is an extremely simplified version of a shipment track and trace application that focuses on monitoring the vehicles that move goods. Each vehicle engaged in a trip periodically notifies its status through events, as shown in the following table:

tripId

tripStatus

position

timestamp

1

planned

(lon: 9.189982, lat: 45.46427)

2022-01-01 12:00:00

1

position-notified

(lon: 9.189982, lat: 45.46427)

2022-01-01 12:15:00

1

loading-started

(lon: 9.189982, lat: 45.46427)

2022-01-01 12:30:00

1

loading-ended

(lon: 9.189982, lat: 45.46427)

2022-01-01 13:00:00

1

departed-from-origin

(lon: 9.189982, lat: 45.46427)

2022-01-01 13:15:00

1

position-notified

(lon: 9.229982, lat: 45.42427)

2022-01-01 14:15:00

1

1

position-notified

(lon: 9.289982, lat: 45.40427)

2022-01-01 16:15:00

1

arrived-at-destination

(lon: 14.258824, lat: 40.84844)

2022-01-02 02:00:00

1

unloading-started

(lon: 14.258824, lat: 40.84844)

2022-01-02 02:30:00

1

unloading-ended

(lon: 14.258824, lat: 40.84844)

2022-01-02 03:00:00

1

completed

(lon: 14.258824, lat: 40.84844)

2022-01-02 03:30:00

Some events represent an actual change in the status of the trip. Others are just an update of the vehicle's position. The track and trace application has the task of reading the events coming from the field and providing its consumers with two distinct interfaces: one to read the updated status of a trip (i.e., TripStatus) and one to access the historical record of the vehicle's positions (i.e., TripPosition).

TripStatus

tripId

tripStatus

timestamp

4

planned

2022-01-0112:00:00

3

completed

2022-01-0112:05:00

1

completed

2022-01-0203:30:00

4

loading-ended

2022-01-0112:15:00

5

planned

2022-01-0112:20:00

6

completed

2022-01-0112:25:00

TripPosition

tripId

position

timestamp

1

(lon: 9.189982, lat: 45.46427)

2022-01-01 12:00:00

1

(lon: 9.189982, lat: 45.46427)

2022-01-01 12:15:00

1

(lon: 9.229982, lat: 45.42427)

2022-01-01 14:15:00

1

(lon: 9.249982, lat: 45.41427)

2022-01-01 15:15:00

1

(lon: 9.289982, lat: 45.40427)

2022-01-01 16:15:00

1

(lon: 9.319982, lat: 45.39427)

2022-01-01 17:15:00

1

(lon: 9.349982, lat: 45.38427)

2022-01-01 18:15:00

1

(lon: 9.389982, lat: 45.37427)

2022-01-01 19:15:00

1

(lon: 14.258824, lat: 40.84844)

2022-01-02 02:00:00

Meeting these requirements is easy with Confluent Cloud by using the following components:

  • Topics on Confluent Cloud Clusters, to store events

  • Schemas on Confluent Schema Registry, for governance purposes and to enable the usage of ksqlDB queries

  • ksqlDB, to transform input data into the desired outputs through a set of ksqlDB instructions

However, implementation is only one part of the solution—the application must also provide a real data product.

Data product life cycle management

There are three fundamental ingredients to make an integration application a true data product:

  1. A contract that formally describes external interfaces and what data is provided.

  2. A self-service platform capable of maintaining these contracts, enforcing them, and automating the data product life cycle.

  3. A person who owns the contract, provides support for its consumers, and uses the platform to manage the data product. 

The contract is foundational since it guarantees a shared and clear method to describe all the elements of the data product. 

First, it is necessary to define a shared specification to describe the contract exposed by a data product. While it is possible to manage the lifecycle of a data product manually, it remains error prone and inefficient. Instead, you should build a platform to automate and manage the data product’s lifecycle, including registration, validation, build, and deploy.

The platform should utilize the information within the contract (i.e., data product descriptor) to register and validate the data product, and then deploy and manage its running instance (i.e., data product container), as depicted in the image below.

In the next sections you will see:

  • How to define the contract of the track and trace application using the open Data Product Descriptor Specification (DPDS).

  • The high-level architecture of a platform for managing the lifecycle of a data product. For reference, we will use the platform we use internally in our projects, the ODM Platform, which will be open sourced in the first half of 2023.

DPDS

The Data Product Descriptor Specification is an open specification that declaratively defines a data product and all its components using a JSON or YAML descriptor document. It is released under the Apache Kafka® 2.0 license and managed by the Open Data Mesh Initiative

DPDS is technology agnostic and allows you to leverage your previous investments by specifying all the components used to build the data product, from the infrastructure up to its interfaces. 

Let’s take a look at the DPDS components for our track and trace example application: 

  • An input port that receives notification events of a change in the trip status 

  • Two output ports, one that continuously reports the current snapshot of each trip status in real time and one that tracks all the positions recorded for the vehicle doing the trip

  • An internal application (i.e., ksqlDB) that transform the incoming events read from input port in the data exposed to consumers through output ports

  • Three infrastructural components (i.e., topics), one to store input data and two to store output data

Next, let’s take a look at how these elements relate to the descriptor (the full descriptor definition is available here).

The info field is useful for other teams to understand what the data product is for and who owns it:

"info": {
        "name": "trackandtrace",
        "fullyQualifiedName": "urn:dpds:com.company-xyz:dataproducts:trackandtrace",
        "version": "1.0.0",
        "domain": "Transport Management",
        "owner": {
            "id": "john.doe@company-xyz.com",
            "name": "John Doe"
        }
    }

The interface components describes the inputPorts and outputPorts, along with descriptions and code references for each:

"interfaceComponents": {
   "inputPorts": [
      {
       "description": "Through this port trip data is read",
       "$ref": "https://raw.githubusercontent.com/Quantyca/odm-demo-trips/main/ports/tripEvent-iport.json"
      }
    ],
    "outputPorts": [
      {
       "description": "This port exposes the current snapshot of each trip in real-time",
       "$ref": "https://raw.githubusercontent.com/Quantyca/odm-demo-trips/main/ports/tripCurrentSnapshot-oport.json"
      },
      {
       "description": "This port tracks all the positions recorded for the vector associated to each trip in real-time",
       "$ref": "https://raw.githubusercontent.com/Quantyca/odm-demo-trips/main/ports/tripRouteHistory-oport.json"
      }
    ]
}

All interface components, regardless of the type of port, describe the agreements between the product developer and its consumers in terms of promises (what is guaranteed by the product) and expectations (how the product should be used). Among the promises, particularly important are the APIs through which data is exposed and the SLOs provided by the data product owner.

Both input and output ports of this use case rely on AsyncAPI v 2.5.0 since it provides a comprehensive definition for streaming data interfaces. Let’s go deeper with the tripCurrentSnapshot output port to see how it is defined:

{
    "fullyQualifiedName": "urn:dpds:com.company-xyz:dataproducts:trackandtrace:outputports:tripCurrentSnapshot",
    "entityType": "outputport",
    "name": "tripCurrentSnapshot",
    "version": "1.0.0",
    "displayName": "Trip Current Snapshot",
    "description": "Topic that tracks the last known information about each trip",
    "promises": {
        "platform": "aws:centraleurope",
        "serviceType": "streaming-services",
        "api": {
            "specification": "asyncapi",
            "version": "2.5.0",
            "definition": {
                "mediaType": "text/json",
                "$href": "https://raw.githubusercontent.com/Quantyca/odm-demo-trips/main/ports/tripCurrentSnapshot-oport-api.json"
            },
            "externalDocs": {
                "description": "The AsyncAPI v2.5.0 specification used to define the API of this port",
                "mediaType": "text/html",
                "$href": "https://www.asyncapi.com/docs/reference/specification/v2.5.0"
            }
        }
     }
}

As you can see from the descriptor, we specified that the output port is a topic that generates a stream of events, as well as the servers information about where it can be found and how to connect to it: 

   "asyncapi": "2.5.0",
    "info": {
        "title": "Current snapshot for each trip",
        "version": "1.0.0",
        "description": "This API exposes the current snapshot for each trip as events"
    },
    "servers": {
        "test": {
            "connectionString": "ABC.confluent.cloud:9092",
            "description": "Confluent Cloud bootstrap servers",
            "protocol": "kafka",
            "protocolVersion": "latest",
            "bindings": {
                "kafka": {
                    "schemaRegistryUrl": "https://ABC.confluent.cloud",
                    "schemaRegistryVendor": "confluent"
                }
            }
        }
    }

Furthermore, we defined the format of each message by using an Avroschema:

"message": {
     "messageId": "trips-value",
     "contentType": "avro/binary",
      "schemaFormat": "application/vnd.apache.avro",
      "payload": {
           "$ref": "https://raw.githubusercontent.com/Quantyca/odm-demo-trips/main/ports/tripCurrentSnapshot-oport-schema.avsc"
      }
}

Finally, the internal components are further divided into two groups: infrastructural components and application components. While the infrastructural components are the muscles responsible for providing all the resources required by the data product, the application is the brain that delivers the results to the output ports. Below is part of the descriptor that defines both these components:

"internalComponents": {
    "applicationComponents": [
    {
      "description": "The app that processes trip data and generates data related to trip route history and trip current snapshot in real-time",
      "$ref": "https://raw.githubusercontent.com/Quantyca/odm-demo-trips/main/apps/processingapp.json"
    }
    ],
    "infrastructuralComponents": [
    {
      "description": "Confluent Cloud platform is used to store all the events, their schemas and that let the app do some processing",
      "$ref": "https://raw.githubusercontent.com/Quantyca/odm-demo-trips/main/infra/confluentcloud.json"
    }
    ]
 }

Let’s take a closer look at the infrastructural component definition:

{
    "fullyQualifiedName": "urn:dpds:com.company-xyz:dataproducts:trackandtrace:infrastructure:eventStore",
    "version": "1.0.1",
    "description": "The kafka topics topology required to store events acquired from external systems and new events genereted by the application",
    "platform": "centraleurope.aws:confluent",
    "infrastructureType":"storage-resource",
    "provisionInfo": {
        "service": "terraform",
        "template": {
            "mediaType": "text/terraformfile",
            "$href": "https://github.com/Quantyca/odm-demo-trips-terraform.git"
       },
        "configurations": {
            "TF_VAR_env_id": "ABC",
            "TF_VAR_cluster_id": "ABC",
            "TF_VAR_rest_endpoint": "https://ABC.confluent.cloud:443",
            "TF_VAR_confluent_cloud_api_key": "ABC",
            "TF_VAR_confluent_cloud_api_secret": "ABC",
            "TF_VAR_kafka_api_key": "ABC",
            "TF_VAR_kafka_api_secret": "ABC",
            "TF_VAR_schema_registry_url": "https://ABC.confluent.cloud",
            "TF_VAR_schema_registry_key": "ABC",
            "TF_VAR_schema_registry_secret": "ABC"
        }
    }
}

It uses Terraform(provisionInfo section)to provision the required topics on Confluent Cloud. Other provisioners can be used depending on your needs (e.g., CloudFormation). This is possible by specifying in the provisionInfo section the provisioner service to use, the template to pass to the service with information on what needs to be provisioned, and optionally a set of configurations to inject into the template at provision time. The same strategy is used to make the build and deployment information of the application components independent from the specific CI/CD service used. In this case the ksqlDB application is deployed using a custom script contained in the template file of deployInfo block.

Open Data Mesh (ODM) Platform

The ODM Platform is the solution we use internally as a self-serve platform to manage the lifecycle of data products from deployment to retirement. 

It’s designed to support and facilitate the transition toward the data mesh paradigm, providing data teams with automation services and high-quality standards. The platform will be open sourced in the first quarter of 2023 on the Open Data Mesh Initiative GitHub page.

The ODM Platform is made up of two primary planes: the utility plane and the product plane. Let’s take a look at the utility plane first, and then follow up with the product plane. 

The utility plane exposes the main APIs of the underlying infrastructure and services. It is an abstraction layer that decouples the product plane services from the underlying infrastructure, allowing it to evolve independently with minimal impact on already implemented products. 

The services exposed by the utility plane offer high-level capabilities to product teams for their self-service data product needs. Some common examples of services exposed by the utility plane include:

  • Policy Service, which creates, manages, and enforces global governance policies

  • Meta Service, which propagates the metadata associated with the data product to an external target metadata management system, such as Confluent Schema Registry, Collibra, or Blindata

  • Provision Service, which creates and manages the infrastructure

  • Build Service, which compiles and packages all the applications

  • Integration Service, which deploys the application artifacts

The following image illustrates how the product plane interacts with the APIs of the underlying infrastructure through the abstraction provided by the utility plane.

In our use case the Provision Service uses a Terraform adapter that, thanks to the wide range of providers supported, enables the management of the infrastructure for almost any kind of product. The Meta Service uses a Confluent Schema Registry adapter to record the schemas of events used by the data products, while the Integration Service uses a custom script adapter to publish ksqlDB queries on Confluent ksqlDB. In another context, we could use the same platform with a completely different set of adapters, such as CloudFormation, Blindata, and Jenkins. ODP’s architecture provides a very flexible platform, allowing you to integrate with the technologies of your choice. 

The data product experience plane is built on top of the utility plane. It exposes the services necessary to manage the lifecycle of a data product. Among the basic services that a typical product plane must provide are certainly the Registry Service and the Deployment Service

The Registry Service allows for publishing a new version of the data product descriptor and making it available on demand to consumers for data discovery and access. To do this, it orchestrates the services offered by the utility plane in the following way:

  1. Validates the syntactic correctness of the descriptor.

  2. Verifies that the new version is backward compatible with previous ones.

  3. Verifies the descriptor's compliance with globally defined governance policies using the Policy Service (e.g., APIs that expose streaming services must be described using AsyncAPI, schemas must be in AVRO and compliant with the cloud event specification).

  4. Saves the metadata contained in the descriptor in an external governance system using the Meta Service.

Below is an image of the detailed workflow.

The Deployment Service is responsible for managing the creation and release of the data product container, which is created using the descriptor. To do this, it orchestrates the services offered by the utility plane in the following way:

  1. Parses the application code and, if compilation is needed, generates the executable artifact (e.g., jar, war, exe, etc.), using the Build Service

  2. Creates the infrastructure using the Provision Service.

  3. Records all metadata in the designed external metadata repository using Meta Service

  4. Deploys applications using the Integration Service.

Note that every time there is a significant change to the status of the data product, the Policy Service must be called to verify the compliance to global policies. Below is an image of the detailed workflow.

Conclusion

Data management is still a major challenge for many organizations. To address this challenge, many companies are turning to modern, decentralized approaches that focus on real-time data. Data mesh and event-driven architectures are two popular options to consider. One key aspect of these approaches is to treat data as a product, which is crucial for their successful adoption. This post shows how to formally define a data product in a streaming context using the DPDS specification, and how this definition can be used to design a platform that can automate its lifecycle as much as possible. For more information about DPDS and the ODM Platform you can check the Open Data Mesh Initiative website and join the community.

  • As the CTO at Quantyca, a premier Italian consulting firm in data management, and co-founder of blindata.io, a state-of-the-art SaaS data governance platform, I bring over 15 years of expertise in the dynamic world of data. I've seen it all, from BI and DWH projects to navigating the challenges of big data, AI, and cloud. My passion for technology and data has only flourished along the way. The practice of data engineering has been a bottleneck in the IT industry, but the ongoing revolution in data management, driven by the growing centrality of data in all areas of life, is changing this. I'm eager to contribute to this revolution and excited to see what the future holds for the world of data.

  • I’ve been passionate about data since I was a computer engineering student and since then, for the last 10 years, I've had the pleasure to work as a data engineer at a consulting company that specializes in data and metadata management. I have collaborated on many projects across different domains, experimented with lots of technologies, and realized data architectures for challenging scenarios. I have personally experienced all relevant changes witnessed by data management and I aim to always be on track with new technologies and approaches to be able to develop modern solutions.

Did you like this blog post? Share it now