Don’t miss out on Current in New Orleans, October 29-30th — save 30% with code PRM-WEB | Register today

No More Swamps: Building a Better-Governed Data Lake Architecture

Written By

Two data challenges exist across almost all organizations: access and trust. These issues scale exponentially as an organization grows to the point that it can no longer hand around sheets of paper or approve database access. The demand for better data access drove the history of data warehousing, following the ethos that better decisions come from more data and that compute would catch up with demand.

However, the hunger for collecting more data didn’t come without a cost. More data collected means more data to manage, which leads to an ever-increasing demand for designing and curating that data with the correct governance. This only intensifies in the age of artificial intelligence (AI). Whereas analytical insights were once consumed from a dashboard by an analyst sitting at the end of a product loop, the demand for this information is now part of a machine speed cycle as agents are sent to crawl out to systems to collect insight to provide better answers, directly in the center of production.

In this post, we’ll explore some of the most common governance challenges that make data lakes and how they impact businesses—as well as a way to overcome these challenges with shift-left analytics.

What Are the Governance Challenges in Data Lakes?

Data governance can be recognized in many forms, but its fundamental job is to be the blueprint of building a good data asset. It’s experienced through technical tooling, processes, policies, and personas, looking at data across its entire life cycle. 

If we investigate the governance challenges faced over the last couple of decades—since the advent of Hadoop and the emergence of data lake and lakehouse technologies—we see weaknesses emerging in a few core pillars, which led to the infamous title of “data swamp.”

Common Data Lake Challenges: Data Quality, Error Handling, and Data Discovery

Managing Data Quality in Data Lakes

The first stage of the challenge is data quality. One driver of moving toward data lakes was to liberate data from operational domains and open up discoverability for the analytical estate—but this decentralization evolved into data quality issues when the management of the data went unenforced.

When there is a disconnect between the producer of the data and those exploring it for insights, inconsistencies and expired data decrease trust. As with any supply chain, a lack of trust also creates a compliance risk: How do we know if any personally identifiable information (PII) has made it into the lake? How do we know that we’re looking at the real, current world state if we have no service level agreement?

Error Handling in Data Lakes

Data quality issues are made even clearer when we explore the next challenge of error handling. There are a number of reasons—from difficult type conversions to table and schema evolution to file access challenges—that can lead to fatal and non-fatal errors. With the volume and variety of data that flows into a data lake, the likelihood of errors should be seen as an if, not a when.

Error handling involves not only having the systems in place to stop errors but also having ways to capture when there are errors and roll back to a stable state to stop completely polluting the downstream systems. In the most advanced systems, this can eventually lead to demands for proactive precautions like quality rules, at least attempting to fail up front and not polluting the systems downstream.

Data Discoverability

Finally, there is data discoverability. Even if we’ve been able to provide high-quality data and omit any errors, all of this will be for nothing if our analysts are unable to discover the data. And if the data is inconsistent or contains errors, there must be provenance to resolve the issues. This pertains not just to the obvious path of working with metadata in catalog but also to the configuration of the underlying storage so that the compute engines can efficiently find files when a search is written.

These issues are not trivial. Poor data quality can have severe repercussions across an organization. It directly leads to:

  • Inaccurate analytics, which in turn result in flawed business decisions

  • Increased operational costs due to the need for manual data cleanup

  • Potential exposure to costly regulatory fines

  • Eroded trust in the data, discouraging its widespread adoption and limiting its potential business impact

You can’t build a skyscraper without laying strong foundations; likewise, governance is the framework on which a data ecosystem is built. With a proactive approach, it’s possible to reduce the long-term costs, complexity, and risk involved with data management, ensuring that data is fit for purpose from the moment it arrives in the lakehouse.

Shifting Left With Tableflow to Unblock Governance Challenges

When addressing governance challenges, we see the pattern that thinking earlier about the quality and controls on our data can improve our analytics processes. This is the core tenet of the shift-left paradigm, which prioritizes collecting data as quickly as possible and also making it valuable and trustworthy as quickly as possible.

That paradigm elevates the importance of metadata such as schemas to the same level as the data itself, with metadata becoming the cornerstone of building trust in the data. It equally encourages us to evolve our systems and processes to enable a shift closer to the source and to not rely on complex point-to-point data pipelines that alienate both sides of the digital ecosystem from one another.

How the shift-left approach to data pipelines moves processing and governance closer to the source

Confluent Tableflow is specifically engineered to bridge the historical and often complex divide between operational and analytical data estates and to allow this shift left. The operational systems are typically optimized for fast, high-throughput transactions and real-time user interactions, while analytical systems are geared toward extensive post-event analysis and reporting.

Tableflow was built with this duality in mind and enables representation of Apache Kafka® topics and their associated schemas as open table formats, such as Apache Iceberg™ and Delta Lake tables, providing a direct feed into your warehouse, lake, or analytics engines with reduced complexity. It automates critical tasks like data preprocessing, preparation, and transfer, all while keeping the catalog fresh and up to date.

When used in conjunction with the wider data streaming platform (DSP) features in Confluent Cloud, Tableflow provides an even stronger framework for data quality and structural consistency through the shift-left approach. If controls are in place to capture schemas as soon as possible and transform data before it enters the analytical estate—and then are presented in an analytics-friendly way via Tableflow—the burden and cost of downstream data quality checks and transformations are significantly reduced. Now this data is not only significantly more trustworthy and usable but also more traceable.

Reinforcing Data Trust Starts With Schematization

Tableflow leverages Schema Registry as the authoritative source for defining table schemas, ensuring structured and consistent data representation within the materialized tables. This provides built-in protections for schema evolution and, if those rules have issues, the option to suspend the materialization flow.

Consumers of tables materialized by Tableflow can access them only in read-only mode, ensuring that any changes originate from the data source and adhere to predefined evolution rules. This is particularly important, as one of the draws of open table formats is the ability to query evolved data without having to rebuild tables—so keeping track of the schema evolution over time can be especially useful.

How Tableflow allows users to query evolved data from Kafka topics without having to rebuild analytics tables

Catalog Management With Tableflow

As mentioned earlier, the flow of metadata is as important as the flow of data to provide a consistently trustworthy platform for analytics. Although Iceberg has a robust metadata layer, this still requires tracking, so we use data catalogs to ensure consistent and coordinated access and to make sure that compute engines know which metadata to read for a table.

To keep the metadata up to date and the catalog synced, the traditional approach is to publish data that you might use through a batch or streaming job, such as with Apache SparkTM, where you take raw data and create microbatches to publish the bucket while updating the manifest files that the catalog will track changes in. This could lead to a challenge: Data is delivered to the analytics estate quickly and could even be worked on, but if the Iceberg-related batch jobs have not been run, the catalog itself might be out of sync. Additionally, some services run a crawler to discover metadata, fundamentally working on a “pull’” method for catalog synchronization.

In comparison, the direct catalog synchronization of Tableflow and its ability to propagate out to multiple catalogs simultaneously minimizes the mean time to analytics by mitigating the need to run scheduled batch jobs for file/table management (e.g., compaction, metadata management, garbage collection). It also seamlessly updates each catalog’s view of the available data for users to explore as quickly as possible, all reading from a common data source.

How Tableflow integrates with analytics catalogs such as AWS Glue, Polaris, and Databricks Unity

This becomes particularly critical for maintaining consistency across multiple catalogs. When there are nuanced differences between formats and conventions that the various catalogs support, having a single system tested to write to all of them provides additional consistency even when using multiple compute engines with different catalog integrations. It even accounts for variety across different open table formats. 

Start Building a Better-Governed Data Lakehouse

The evidence is clear: In the modern data ecosystem, there is an imperative for reliable, consistent, and easily discoverable data. Data is not an exhaust of activity but an asset in itself, which means we need to treat it as such. Both of these observations will only accelerate as agentic AI increases system demands for data to be available at machine speed and to produce trustable outputs from non-deterministic systems. Prioritizing governance early should therefore be a top consideration for all organizations. Technology like Tableflow is an important part of the equation for better delivery of well-governed data. But it also must be paired with the right organizational mindset so that people instinctively think about data assets and the correct associated governance to build the practices and strategies that can start spinning the flywheel of better governance.

The holistic considerations for building a better-governed data lake stretch beyond just implementing tooling to accelerate the mean time to analytics. Based on experiences with businesses across the globe and across industries, the following have become clear:

  • Start with a strong governance framework. Federated architects require clear data governance policies, defined roles and responsibilities (e.g., data stewards, data governance team), and measurable metrics for success. As with all projects, appointed executive sponsorship will cement the adoption of the measures and metrics.

  • Embrace shift-left data quality. Actively use tools and processes, such as Confluent Cloud for Apache Flink® integrated with Tableflow, to cleanse, transform, and validate data as close to the source as possible. This proactive approach ensures that higher-quality data enters the lakehouse, reducing downstream remediation efforts.

  • Implement a robust data catalog. Integrate your open format tables with a comprehensive data catalog solution (e.g., AWS Glue Data Catalog, Snowflake Polaris Catalog, Unity Catalog) and utilize their features (e.g., lineage, annotations). This enables efficient data discovery and further tracking of data lineage and also facilitates effective metadata management across the entire data estate.

  • Automate where possible. Automate data quality checks, schema enforcement, and table maintenance operations (such as file compaction and cleanup). Automation reduces manual effort, minimizes human error, and improves consistency across the data platform. These tasks can become bottlenecks as the platform adoption matures, so, when possible, use tooling that handles it for you to help scale.

  • Foster a data-driven culture. Promote data literacy, encourage collaboration across different teams and business units, and ensure that data is widely valued as a key organizational asset. Controls and technologies should become enablers for, not barriers to, innovation and efficiency.

Architecture for a Better-Governed Data Lake With Tableflow and the Confluent Cloud DSP

These practices align closely with what we talk about when designing a Data Streaming Organization, but the practices are particularly fruitful when used across the entire data stack in your organization. Exploring the pillars and sections of the Data Streaming Readiness framework ties directly back to building in governance for getting value out of your data products.

Learn more about how to implement data governance in your organization in our white paper Maximize Data Value and Innovation Across the Organization.


Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, Apache Iceberg™️, Iceberg™️, Apache Spark™️, Spark, and the Kafka, Flink, Iceberg, and Spark logos are either registered trademarks or trademarks of the Apache Software Foundation. No endorsement by the Apache Software Foundation is implied by the use of these marks.

  • Alex Stuart is a Senior Solutions Engineer at Confluent, partnering with strategic organizations across the U.K., South Africa, and beyond to help them realize their data streaming visions. Just like the data he works with, Alex is always “in motion” as a running community leader and globetrotter, having visited 55 countries.

Did you like this blog post? Share it now