Confluent
How I Learned to Stop Worrying and Love the Schema
Stream Processing

How I Learned to Stop Worrying and Love the Schema

Geoff Anderson.

The rise in schema-free and document-oriented databases has led some to question the value and necessity of schemas. Schemas, in particular those following the relational model, can seem too restrictive, and the case has been made that software development can be faster and more agile without them. However, just because it’s possible to go without schemas doesn’t mean it’s wise to do so – this sort of local optimization can cause huge headaches within even a small organization.

The Hazards of Many Languages

Imagine for a moment that you work at a company where all employees are required to speak only in their native language. Intercommunication can work, but either everyone has to be multilingual, or expensive translators must be added for every pair of languages spoken in the company. Even if you have a sophisticated and efficient way of getting messages from place to place, you’re still stuck with the overhead of constant translation.

Furthermore, even if your company phases out, say Latin, unless you are willing to discard all Latin records, you’re either stuck employing your Latin translators for the rest of eternity, or with the work of converting all Latin records into a new language.

Compare this to a company which standardized on a single language from the start. Every single form of communication is easier, and every message can be consumed many times at zero extra cost. Although there is an up-front cost in the sense that new employees must already know the language or be trained in it, the payoff is huge and permanent.

Having no standardized way of defining data across an organization presents a similar problem. It may be fine in the short term, but it quickly causes unnecessary difficulties, and it just doesn’t scale when you imagine multiplying by potentially thousands of different categories of data.

Temp-o-meter – A Tale of Woe

Let’s illustrate with a simple example. Say that your company has built a smartphone app called temp-o-meter which collects temperature data and sends it back to HQ. Version 0.1 was produced in a hurry, and produces simple comma separated data points with the format

“device_id, temp_celsius, timestamp, latitude, longitude”

A typical data point might look like this:

“123,100,1431811836.081531,37.386052,-122.083851”

Pretty reasonable, but time passes, and the team decides JSON is easier to work with, and temp-o-meter v0.2 produces data like this:

{
    “device_id”: 123,
    “timestamp”:1431811836.081531,
    “temperature”: 212,
    “latitude”: 37.386052,
    “longitude”: -122.083851
}

The problem is, some stage(s) in the downstream pipeline must now have logic to differentiate between CSV and JSON, and this logic must be aware that in the CSV format, temperature readings are in Celsius, but temperature stored in JSON is in Fahrenheit. What’s more, there may be some users who never upgrade their app, so the different versions of this data will continue to be published indefinitely.

Granted, this example is a bit contrived – clearly, for a given type of data, it’s not great to represent it with a mix of formats such as CSV, JSON, or XML, etc. However, just standardizing on a format such as JSON without schemas is not enough. Standardizing on JSON without using schemas is a little like standardizing on the Roman alphabet without standardizing on a language – everyone can easily read and write individual letters, but that still doesn’t guarantee they can read the messages!

Let’s go a little further with the temp-o-meter example and pretend that we live in a science-fiction world where not only phones, but even things like watches can produce streams of data. temp-o-meter needs to be ported, but lucky for the watch team, JSON is now the standard, and the format of temperature data was loosely documented on an obscure wiki page.

Here’s what the temp-o-meter watch team came up with:

{
    “device_id”: <b>“watch_345”</b>,
    “timestamp”: <b>“Tue 05-17-2015 6:00”</b>,
    “temperature”: 212,
    “latitude”: 37.386052,
    “longitude”: -122.083851
}

Not so bad on its own. However, although the format is now JSON, and although the field names are identical, the watch team used slightly different data formats in a few of the fields. “device_id” is no longer parseable as an integer, and “timestamp” is a completely different format.

Why is this a problem? Suppose there is an application which consumes data produced by the temp-o-meter v0.2 – it might have a chunk of code like this:

# Consumer parses a chunk of data from temp-o-meter v0.2
data = json.loads(data_chunk)

device_id = int(data[‘device_id’])
timestamp = float(data[‘timestamp’])
temperature = float(data[‘temperature’])
latitude = float(data[‘latitude’])
longitude = float(data[‘longitude’])

This consumer must be upgraded before the watch team releases, otherwise it will be completely broken when it encounters the (unintentionally) new data format. Despite the fact that the watch team and phone teams now both use JSON, different components of the system are now tightly coupled because the data’s ‘schema’ is embedded in both producers and consumers of this data. The ability to safely and independently evolve different components of the system has been hamstrung.

Had this company been using schemas, the watch and phone teams could have simply reused the same schema, avoiding the need for the watch team to reinvent the wheel, and preventing subtle incompatibilities which ultimately break a bunch of downstream consumers. By sharing the schema between watch and phone apps, this unintentional data evolution would have easily been avoided.

DRY Your Data Definitions With Schemas

At this stage in our little story, the ‘definition’ of temp-o-meter’s temperature data is decidedly un-DRY: it is encoded informally in temp-o-meter v0.1, temp-o-meter v0.2, some wiki pages, the watch app, and in all the various consumers which later parse and analyze this data.

On the other hand, by using schemas, the data definition for a particular kind of data exists in a single place. What’s more, schemas serve as self-contained and automatically enforceable contracts between writers and readers of data. Though they don’t remove the need for testing, schemas make testing data compatibility significantly simpler and can nip an entire class of problems in the bud by preventing corrupt, malformed or incompatible data from ever being published in the first place.

For additional compelling reasons to use schemas, it’s worth revisiting this post on Stream Data Platforms.
In the next post on schemas, I’ll talk more about how schemas can provide a powerful tool to help evolve data formats in a sane and compatible way.

Subscribe to the Confluent Blog

Subscribe
Email *
[ssba]

More Articles Like This

picture1-768x386-1-350x176
David Tucker

Announcing the Certified DataStax Connector for Confluent Platform

David Tucker . .

Apache CassandraTM, with its support for high volume data ingest and multi-data-center replication, is a popular and preferred landing zone for web-scale data. DataStax Enterprise (DSE), a data management layer ...

data center cloud
Jay Kreps

Sharing is Caring: Multi-tenancy in Distributed Data Systems

Jay Kreps . .

Most people think that what’s exciting about distributed systems is the ability to support really big applications. When you read a blog about a new distributed database, it usually talks ...

Streams
Matthias J Sax

Data Reprocessing with Kafka Streams: Resetting a Streams Application

Matthias J Sax . .

This blog post is the third in a series about Kafka Streams, the new stream processing library of the Apache Kafka project, which was introduced in Kafka v0.10....

Leave a Reply

Your email address will not be published. Required fields are marked *

Try Confluent Platform

Download Now