Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
I ran into the schema-management problem while working with my second Hadoop customer. Until then, there was one true database and the database was responsible for managing schemas and pretty much everything else. In comparison, Hadoop and many other NoSQL systems are the wild west – schemas still exist, but there is no standard format and no enforcement – leaving each developer on their own when it comes to figuring out what data is stored in which table and directory. For the rest of the post I’m assuming the reader is already familiar with the concept as well as the importance of data schemas. For those unfamiliar we recommend to read “How I Learned to Stop Worrying and Love the Schema, part 1”
When I started working on Hadoop projects, I found myself working on the single most common Hadoop project: Data Warehouse Offloading. DWH Offloading projects are all very similar and include three main requirements:
Usually, “a bunch of tables” means “around 10,000”, so automation is key. Since some Sqoop connectors can’t load data into Hive automatically, we had to dump data into Avro files and then use another script to load the data into Hive. The schema can be pretty large, so we need to store it in HDFS and just give the URI to Hive. Then a large number of other applications can start processing the data. These applications also need to be aware of the schema we previously stashed away.
Lets look at the information flow in this use-case:
Note how critical it is to save the schemas somewhere safe – the whole process depends on our ability to get the Avro schema and share it with every app that will need to access the data. The shared schema allowed for data discovery and for different teams to use the same data for their applications and analysis. No shared schema means no data sharing.
At first those metadata directories holding the schemas were rather ad-hoc, but soon we realized that this is a common pattern.
There is an important general lesson. As I wrote earlier: Schemas are critical and a shared repository of all schemas used by your organization is important to make siloed knowledge shared and explicit. Different projects need to collaborate on the same data – and to do that, they all need to know the metadata – which fields are available and what are their types. The core need is for readers to be able to understand and read data written by upstream writers, at all times.
Let’s look at the benefits of schema registry in detail:
Because shared schemas are so critical, I’ve implemented this process with some small variations for at least 10 different customers. All doable, but our customers were forced to reinvent the wheel time and again because this critical component was missing from their data platform.
The need to manage schemas became more pronounced with the rise of real-time data and the need for stream processing. In stream processing pipelines, there are no files to act as containers for messages with a single format. Instead, we just see a stream of individual records that can be of any type, and we need an efficient way to determine the schema for each arriving record. You can’t send your schema together with each record since that incurs huge overhead — schemas are often larger than the records themselves. But if you don’t send your schema with every record – how will your stream processing tools of choice be able to process the data? How will you load the results to HBase or Impala?
Every stream processing project I’ve seen was forced to re-invent a solution to the schema management problem, just like my older ETL projects did.
Every project needing to implement some hack sooner or later points to a global need that must be addressed at the platform level. This is why we believe that Schema Registry is a must-have component for any data storage and processing platform, and especially for stream processing platforms.
This need manifests itself in various ways in practice – For example, Hive with Avro requires the schema as a parameter when creating a table. But even if an application doesn’t require the schema, the people who write the application need to know what is the fifth field and how to get the username out of the record. If you don’t provide all the developers a good way to learn more about the data, you will need to answer those questions again and again. A good schema registry allows large number of development teams to work together more efficiently.
This is why Confluent’s stream data platform includes a Schema Registry. We don’t want to force anyone to use them, and if you decide that Schemas are not applicable for your use-case, you can still use Kafka and the rest of the stream processing platform. But we believe that in many cases, a schema registry is a best practice and want to encourage its use.
Because our Schema Registry is a first-class application in the Confluent Platform, it includes several important components that are crucial for building production-ready data pipelines:
Having the Schema Registry in the Confluent Platform means that from now on we can focus on making our Schema Registry better, rather than keep implementing the same thing again and again. And our customers can focus on the data and its uses, rather than reimplementing basic data infrastructure.
This blog announces the general availability of Confluent Platform 7.8 and its latest key features: Confluent Platform for Apache Flink® (GA), mTLS Identity for RBAC Authorization, and more.
We covered so much at Current 2024, from the 138 breakout sessions, lightning talks, and meetups on the expo floor to what happened on the main stage. If you heard any snippets or saw quotes from the Day 2 keynote, then you already know what I told the room: We are all data streaming engineers now.