Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
The worlds of data integration and data pipelines are changing in ways that are highly reminiscent of the profound changes I witnessed in application and service development over the last twenty years. The changes in both cases are not purely technical or architectural. The industry learned that microservices offered a better way of building service-oriented architectures (SOA), of doing services right, not simply because of some radical shift in API specifications or protocols (though I wrote a book on that), but because microservices embraced a new set of development and delivery practices—a DevOps mindset, iterative and incremental development, automation everywhere, continuous delivery, and so on. Now the same disciplined and developer-oriented practices that helped organizations integrate themselves using services are helping them integrate using their data. This is a step-change for the integration world, and in this blog post I show why.
As companies become software, data has become increasingly critical to their success. Once a mostly back office concern—behind the scenes, as it were—data is now in the products and services companies offer their customers. Innovation and world-class execution depend on an organization's ability to discover, unlock, and apply its data assets.
Data pipelines perform much of the undifferentiated heavy lifting in any data-intensive application, moving and transforming data for data integration, analytics, and machine learning purposes. They coordinate the movement and transformation of data, from one or more sources to one or more targets—adapting protocols, formats, and schemas; and cleansing, normalizing, and enriching data in preparation for its application in the target systems. Pipelines are typically used to integrate data from multiple sources, in order to create a single view of a business meaningful entity (combining CRM data and purchasing histories, for example, to create a single view of a customer); transform and prepare data for subsequent analysis (extracting data from a homegrown trading system or a SaaS source like Salesforce and analyzing in Tableau, for example); or perform feature extraction on raw data prior to training a machine learning model (reducing credit card transaction data containing known instances of fraud to numerical features that can be used to develop a fraud detection model, for example); or score a dataset with a pre-trained model (using the previously trained fraud detection model to assess new transactions). While not in themselves sources of value, they nonetheless provide an essential service on behalf of the applications through which an organization generates value from its data.
And yet despite their being critical to the data value stream, pipelines haven't always served us well. Notoriously fragile, built on an ad hoc basis to satisfy specific needs, with little thought to reuse or composability, different pipelines in the same organization often contain redundant calculations and varying interpretations of derived values, which together engender a lack of trust in their outputs. In many organizations, it's not clear which pipelines are in use and which have been abandoned—or even who owns them. For many companies it is difficult, if not impossible, to proactively identify wrong outputs, or trace errors and the causes of data loss back to their origin.
The last few years have seen a big shift in the ways companies think about, organize around, and build solutions that unlock their data. As more and more companies move large parts of their data estates to the cloud, a dizzying number of cloud-based data platform products and services—collectively, the modern data stack—have come to market to accelerate and improve their data management capabilities.
Our focus in this post is on the “pipeline problem.” From out of the welter of recent innovation and experimentation, we’ve chosen several trends that indicate how the industry is attempting to improve the pipeline experience. We look at the problems or needs these trends are seeking to address, and from this analysis we derive five principles for building better pipelines, which we call Streaming Data Pipelines.
The first trend is the one with the broadest organizational and architectural impact: data as a product. One of the four cornerstone principles of Data Mesh, data as a product overturns the widespread notion that the sharing of data between systems must invariably accommodate the internal idiosyncrasies and operational priorities of each system, positing instead that externalized data should purposely be designed to be shared and applied in ways that prioritize the needs of the consumers of that data. To treat data as a product is to fundamentally rethink an organization’s relationship to its data.
Our interest here is in the way data as a product marks a shift to decentralization and federation so as to better facilitate sharing and reuse. Concerns and responsibilities are reassigned from a domain-agnostic central data function back to the teams who best understand the data. From a data owner’s perspective, data exchange is no longer a matter of one or more external agents extracting raw data from a source system in an ad hoc manner, but a purposeful sharing of data in ways that satisfy the expectations and needs of consumers—whether end users or other systems. A data owner’s responsibilities don’t end with allowing access to the underlying data, but extend to maintaining a healthy, accessible, easily-consumed product throughout the data lifecycle.
But this shift can only succeed if accompanied by other, more low-level changes in the ways we build pipelines. Pipeline development has long been a purely technical concern, bottlenecked in a centralized data team’s over-utilized efforts at integration, but we’re now starting to see pipeline builders adopt the same disciplined practices that have driven modernization in other parts of the software industry. To deliver value quickly and repeatedly to customers, we need deep customer engagement, fast feedback loops, and a form of development that allows for continuous evolution. In the software development space, a set of agile and DevOps practices have emerged over the last twenty years to help deliver software faster and with better quality: these range from unit testing and test-driven development, through continuous integration and continuous delivery, to continuous observability and automating infrastructure as code. Today we’re seeing some of these same practices being inflected into the data management space, starting with some of the low-level software development techniques for creating expressive, well-factored, testable, and tested system units.
Which brings us to the second trend that interests us…
Declarative transformation models allow data transformation logic to be packaged into discrete, expressive, testable, and composable units. For too long, pipeline transformation logic has been entangled with flow-of-control logic, tied to integration tooling and its runtime environment, and hidden behind a UI. Faced with the impossible task of factoring out meaningful transformations from imperative code artifacts and workflow definitions that include UI presentation logic—never mind testing and versioning them—data teams have resorted to creating new pipelines for each workload. Which of course contributes to the fragility of the pipeline estate and a decline in data quality. Changes to upstream data sources break multiple pipelines; varying implementations of common metrics and business logic generate divergent and untrustworthy results. Because even the small change risks propagating breakages across the estate, engineers are fearful of changing existing pipelines, preferring instead to add ‘just one more’ to address each new workload.
dbt is a declarative transformation model framework that allows data teams to create models using SQL SELECT statements. These models describe what a transformation should do. They are then executed inside a target—a data warehouse, for example—to create and populate tables. Each model encapsulates a single SELECT statement, but models can be composed from other models using reference functions. This allows a developer to build complex models from simple building blocks without having to repeatedly inline or copy-and-paste dependent SQL statements: an aggregate advertising revenue model, for example, can be composed from separate Google Ad and Facebook Ad models. If a model changes—perhaps because an upstream data source has changed—that change will be propagated throughout the model dependency graph: no more accidentally divergent pipeline logic.
At first glance, this doesn’t appear to be anything revolutionary: after all, data teams have been using SQL inside their pipelines for decades. But there are several things that mark this out as being a step change in data engineering.
First, tools such as dbt place the SELECT statement at the heart of model building. SELECT FROM is an extremely concise and expressive way of describing what the data you want should look like, and where it should come from: its declarative nature allows you to focus on describing your data needs, without having to work out (and optimize) the sequence of operations necessary to assemble the data to fulfill those needs.
Second, a model framework’s modular nature allows you to break complex transformations into smaller, simpler parts, and to reuse the same building blocks in different contexts—a common software development practice. This kind of modularity is missing from the copy-and-paste blocks of monolithic SQL and flow-of-control logic typical of traditional pipelines, where transformation logic is repeated, sometimes with ad hoc variations, from pipeline to pipeline, and where the logic itself can quite easily diverge from the target schema to which it is to be applied.
Third, a good declarative model framework allows for constraints and tests to be written for each model, and asserted both during development and at runtime. This is the equivalent of unit testing for data. Not only does it provide a bedrock of asserted behavior that allows data teams to confidently make controlled changes to their models and pipelines, it also opens up interesting possibilities from a product perspective. A product’s development is driven by the expectations and needs of consumers: tests provide a concise, expressive, and executable vehicle for consumers to communicate these requirements to the data product’s provider.
dbt isn’t the only declarative data model framework to have emerged in the last few years, but its unvarnished use of a highly commoditized query language, SQL, and developer-oriented software development lifecycle affordances have ensured its widespread adoption. Elsewhere we see Looker’s Malloy, an “experimental language for describing data relationships and transformations,” and RelationalAI’s Rel, which is described as an “expressive, declarative, and relational language designed for modeling domain knowledge.” Of the two, Rel is the more mature, with a rigorous syntax that allows subject matter experts and data practitioners to build executable specifications that express complex relationships in their data.
The move to align data pipeline tooling with modern software delivery practices is perhaps nowhere more evident than in the emergence of a new breed of developer-friendly visual pipeline builders. Traditional data platforms are very much closed environments, quite separate from the rest of the software development toolchain. Mostly UI-based, with little or no API support, such platforms rarely produce intelligible code artifacts: at best, they emit either a binary or a proprietary file format closely tied to the operation of their UI.
All this has changed. Platform development tools such as Prophecy, Upsolver, and Quix provide browser-based development environments that generate intelligible code artifacts. These tools combine drag-and-drop canvas builders and inline code and query editors, allowing non-technical and technical users to collaborate. While the development experience is still primarily UI based, the big difference here is in the outputs: open formats such as Python or SQL for transformation logic, YAML for describing the pipeline topology, for example.
Besides supporting open, and often declarative, pipeline representation formats, these new tools also play nicely with the rest of the modern software development toolchain via integrations with version control and continuous integration and continuous delivery systems. Pipeline builders can now browse catalogs of data sources and reusable pipeline assets; propose, review, and version changes; test and debug pipeline components in an iterative and incremental fashion; and promote approved pipelines from test through to production environments. We call this developer-friendly or developer-oriented tooling, not because it appeals solely to developers, but because it brings the benefits of the modern software development toolchain and its associated practices to everyone involved in building pipelines.
The trends that we’ve looked at so far have tended to emphasize decentralization—putting the ownership of a product, model, or metric back in the hands of the teams best placed to understand the jobs a data capability must perform, and allowing for capabilities to be shared so that they can be reused in multiple contexts. In contrast, the next trend, ELT (extract, load, and then transform), encourages centralization; more specifically, the execution of all transformation logic in the data warehouse or data lake.
The modern data stack is often described as an unbundling of the traditional data platform: a disaggregation into discrete capabilities—transformation, orchestration, discovery, governance, and so on—that can then be re-assembled to deliver a better, more comprehensive data experience than the monolithic data platforms of old. Paradoxically, this splintering of features into separate tools has also helped cement the importance of two stalwarts of the data landscape: the data warehouse and data lake. Irrespective of the particular blend of tools used in an organization, all data extracted from operational systems is invariably directed towards one or other of these destinations. ELT, because it depends on the data warehouse or data lake for its transformation capabilities, further consolidates the central importance of these two technologies.
ELT defers the transformation stage of the pipeline process until after the data has been loaded into the destination (whether a data lake or warehouse), leveraging the increased scalability and performance of modern, cloud-based data warehouse and data lake technologies to apply transformations to data in situ. This allows for data to be extracted in a speculative fashion from source systems and then loaded in its raw form into the target, without a data team first having to consider the use cases or workloads that might apply it. In this respect, ELT supports a more agile data pipeline process, allowing teams to build and evolve well-formed models on top of the raw data as and when they learn more about the needs of downstream consumers—a welcome step towards building a fast-moving, responsive, data-driven organization.
But the approach also has its downsides. With no business-meaningful goals motivating the ingestion of data, ownership and responsibility are diluted. Multiple overlapping tables of raw data, with no clear purpose or lineage, sit side-by-side with well-formed, purposefully modeled schemas: the boundaries between schemas blur, accidental dependencies between poorly defined tables accrue, and without clear documentation and governance, data quality and productivity suffer.
There's one further downside here. ELT, though it has the benefit of allowing data to be extracted from source systems without knowing ahead of time how it will be accessed, is usually a batch-oriented process. Complex pipelines that derive refined data products from raw extracts can form large dependency graphs: when many processes are chained together in this way, changes can take a long time to propagate from source systems to the consumable products. Delays at any stage will cause the remainder of the pipeline to stall. Long latencies, stale data, and opaque lineage inhibit an organization's ability to actually get value from derived data products.
If ELT further consolidates the role of the data warehouse or data lake as the ultimate destination for all enterprise data, Reverse ETL, another recent trend, can be seen as a symptom of this centralizing force. Reverse ETL allows data that has been integrated in the data lake or warehouse, and/or the results of subsequent analyses of that data, to be shared back with operational systems. Reverse ETL exists precisely because the data warehouse is the site of all data integration. Operational systems in any reasonably complex estate invariably need to use data belonging to other systems in order to fulfill their application goals; hence they must source this shared data from, or have it delivered by, a process attached to the data warehouse. Reverse ETL surfaces a genuine and important need, to share data and analyses with operational systems, but there must be simpler ways of satisfying this need, surely? We’ll return to this question shortly.
So what have we seen? New tools and practices—often inspired by the revolutions in software development and delivery practices of the last two decades—signal a tidal change in data management. But there are counter currents also, which reinforce old habits or undermine the success of some of the new approaches.
On the one hand, we have a move to decentralized architectures, which assign clear ownership of data capabilities to the teams best placed to address the needs of consumers. By dividing and delegating responsibilities to the teams closest to the data, these architectures better enable the sharing and reuse of trusted data, thereby reducing the proliferation of duplicated, ill-qualified, and ad hoc datasets. We have tools that facilitate a separation of concerns, and which support composing, testing, and versioning transformation logic via the modern developer toolchain. These tools enable data teams to adopt the same agile and DevOps developer practices that engineers use to successfully deliver software elsewhere in the organization. And we have the use of declarative languages such as SQL to describe where data comes from, where it goes, and what it looks like as it undergoes transformation. Declarative languages allow data teams to spend more time productively focused on realizing good data outcomes than on writing fragile sequences of instructions to coordinate the movement and transformation of data.
But on the other hand, we see a tendency to overload the data warehouse or data lake with speculative exports of data with no discernible business value, and to treat the warehouse or lake as being solely responsible for data integration and data sharing: practices that make it difficult to know who owns what, or which data to trust, and which complicate the integration of operational systems with lots of extra plumbing.
There’s something missing from the picture.
That something is stream processing. As more and more companies collect streaming data to proactively engage with customers and manage real-time changes in risk and market conditions, the need for streaming data platforms and stream processing capabilities has grown. Whether it’s augmenting established batch-oriented data processing platforms with the ability to consume from or publish to streams, and process streaming events using micro-batch-based tools, as with AWS Glue’s streaming ETL jobs, or new entrants such as Meroxa, Decodable, and DataCater—data processing platforms built from the ground up to handle streaming data—the modern data stack is expanding rapidly to accommodate the increased demand for real-time data pipelines.
Stream processing is a foundational capability for modern data processing. It aligns well with the good practices outlined above, and also provides alternatives to the needs addressed by ELT and reverse ETL: to progressively adapt and refine data as new requirements emerge, and to share data and the results of analysis with operational systems.
Streams provide a durable, high-fidelity repository of business facts—of all the work that has taken place in an organization. Acting as low-latency publish-subscribe conduits, they allow systems to act on events as they happen in the real world: no more stale results or work built on out-of-data datasets derived from periodic snapshots. Importantly, streams create time-versioned data by default, enabling consumers to reconstruct state from any period in the past, and time-travel to reprocess data and apply new or revised calculations to prior tracts of history. The high-fidelity aspect of the stream is critical in this respect: while data warehouse and data lake datasets will often contain timestamped facts and slowly changing dimensions that reflect historical changes to state, the fidelity of the record—the degree to which every change that has occurred outside the dataset is captured and stored in the dataset—is dependent on application-specific design and implementation choices; streams, in contrast, retain every event, irrespective of the workload or use case.
Considered in the context of data as a product, streams offer a powerful mechanism for data owners to share well-modeled data with clients and consumers without first having to send it to a data warehouse or data lake. Producers publish events whenever meaningful changes occur in the systems for which they are responsible. Data is published once, but can be consumed many times, by different subscribers, for both operational and analytical purposes. This approach supports both decentralization and reuse, making the owners of source systems responsible for creating affordances that allow their data to be reused in multiple contexts.
Subscribing applications act on the events they receive to update internal state or trigger additional behaviors. In the simplest scenarios, streams provide for direct data integration, with source systems using Change Data Capture (CDC) mechanisms to publish changes that can then be used to update operational databases (MySQL or PostgreSQL, for example), search and observability applications (such as Elasticsearch), and analytics targets such as data warehouses and data lakes.
More complex solutions use stream processing to consume from one or more source streams, continuously apply calculations and transformations to the data while it is in motion, and publish the results as they arise to a target stream. Like the standing patterns that emerge where a river flows over a weir, stream processing delivers a continuously changing, always up-to-date result as a function of the flow of data. The trend here, as with the declarative transformation models we looked at earlier, is to use SQL: ksqlDB, Materialize, and Snowflake’s stream processing all use SQL to describe where data comes from, where it goes, and what it should look like along its journey. Using these tools, teams can enrich and transform the data while it is in flight, creating derived streams for new products such as metrics, denormalized views over multiple sources, and aggregations.
What emerges is a network of real-time data. If you feel your data capabilities today are underperforming, if you’re frustrated working with multiple siloed pipelines directed towards a centralized data warehouse or data lake, imagine instead a streaming, decentralized, and declarative data flow network that lets the right people do the right work, at the right time, and which fosters sharing, collaboration, evolution, and reuse.
This network chains transformation logic together to act on streaming data immediately, as it flows through the system. With this kind of data flow, there’s no need to wait for one operation to finish before the next begins: consumers start getting results from the stream product they’re interested in within moments of the first records entering the network. New use cases tap existing streams and introduce new streams, thereby extending the network.
This doesn’t eliminate the data warehouse or data lake; rather, it redistributes traditional pipeline responsibilities in line with some of the practices outlined above so as to provide clear ownership of timely, replayable, composable, and reusable pipeline capabilities that create value as the data flows—whether for low-latency integrations for operational systems, or analytical or machine learning workloads.
Of course, nothing comes for free. Just as a microservices architecture can only succeed if you adopt a new set of development and delivery practices—a DevOps mindset, iterative and incremental development, automation everywhere, continuous delivery, and so on—so the evolution of a data flow network calls for a disciplined, developer-oriented approach, with governance a first-class concern. If decentralization helps increase the reusability of streams in the data flow network, governance helps reduce the risk and operational overhead of managing the kinds of complex, distributed environments that decentralization brings. Platform-level governance ensures the network is protected, can be run effectively by decentralized domain-oriented teams, and promotes collaboration and trust throughout the organization.
We call this overall approach to building better pipelines Streaming Data Pipelines. Streaming Data Pipelines can be characterized with five principles, derived from the trends we’ve witnessed emerging over the last few years:
Pipelines are the essential plumbing without which a data-driven organization can’t function. We specify Streaming Data Pipelines with five principles. These principles are important because they encapsulate the end goals pipeline developers should measure their solutions against. Building streaming data pipelines means building pipelines in ways that scale to address not only the data requirements across an organization, but also the communication, accessibility, delivery, and operational needs of the entire data ecosystem.
Ready to delve deeper into Streaming Data Pipelines? Check out these other great resources to help you along the way:
The Confluent for Startups AI Accelerator Program is a 10-week virtual initiative designed to support early-stage AI startups building real-time, data-driven applications. Participants will gain early access to Confluent’s cutting-edge technology, one-on-one mentorship, marketing exposure, and...
This series of blog posts will take you on a journey from absolute beginner (where I was a few months ago) to building a fully functioning, scalable application. Our example Gen AI application will use the Kappa Architecture as the architectural foundation.