Build Predictive Machine Learning with Flink | Workshop on Dec 18 | Register Now
If you work with Kafka Streams, Apache Kafka® clients, and Schema Registry, you’ve likely come across this error:
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
This error might be frustrating and mysterious at first, but hopefully, after you review this blog post, you’ll feel more confident to handle it.
Magic byte is the first few bits of data in a file, meant to help identify the file content. They are also known as “magic numbers.” So, what causes the “unknown magic byte” error in Kafka Streams?
In general, there’s an essential problem within the publish/subscribe data architecture pattern: How do you make sure that the data formats of the publishers (in Kafka’s case, producers) match the data formats of the subscribers (in Kafka’s case, consumers)?
Kafka solves this problem with a Schema Registry. The Schema Registry lives outside of the servers that host the producers and consumers, and topics. Messages are validated by the schema before being sent. Here’s an illustration from Confluent documentation that shows how it works:
Schemas are available in three formats: Avro, Protobuf, and JSON. The Schema Registry reads the file signature to determine the format. If it doesn’t match, the dreaded “unknown magic byte” error is thrown.
Generally, the “unknown magic byte” error means that you must reconcile serialization methods and check that your schemas use the same format on the production and consumption ends. You might have to do this in the client, in ksqlDB, or in Kafka Streams, depending on your project.
I ran into the error myself recently in a Shakespeare app that I created. I was pulling data from the Folger Shakespeare API, and trying to join two topics created with the API results in ksqlDB. I had never implemented a Schema Registry in a client before, so I didn’t realize that it was necessary when you use ksqlDB. Nothing showed up in my streams after I created them!
Luckily, ksqlDB has something called a “processing log,” which is a stream of metadata about your instance. Querying it resulted in an error with an “unknown magic byte” string.
Turns out, when you work with ksqlDB in Confluent Cloud, and you create a stream, you need to include the value format so the value can be deserialized:
Since I created the stream using the “AVRO” format, I needed to set up an Avro schema in my client using confluent-schema-registry. You can view the full code that I added on my GitHub file page, but here’s the gist:
Again, this is something I’ve run into myself, so you’re not alone! Similarly to “unknown magic byte”, “unknown magic number” means that a re-evaluation of the serialization setup is needed. Double-check your schemas, making sure that they are registered on the client as well as in Confluent Cloud. Also, make sure that the serializer you’re using is the right one to serialize your events according to the specified data contract. If all else fails, delete your table and re-produce the messages—if any unformatted messages exist, Apache Flink® may be interpreting them as corrupted and won’t connect to the table.
If you’re still curious about magic bytes, the following tutorial helps to solidify the concept of a magic byte.
Steps:
1. `git clone https://github.com/Cerchie/magic-byte-illustration.git && cd magic-byte-illustration`
2. Now view PK.zip in your text editor. It will look like the following:
3. PK is the file signature, or magic byte, for the zip file format. You can verify it by running `file PK.zip`.
4. Let's change the file signature bytes so that `file` reads this file as a PDF.
Erase “PK” on line 1 and replace it with “%PDF”.
5. Run `file PK.zip` to confirm. Note that while the extension is still ".zip", the file signature is for a PDF, so it's identified as a PDF. Pretty cool huh?
This post uses Node.js for the solution; but if you’re looking to resolve it in other languages, you can take a look at the documentation for the Java client, or ask about it in the Confluent Community Slack for other languages:
Confluent Java Client Docs – get started with the Java client
Confluent Community – a place to ask your questions about Apache Kafka
Learn what windowing is in Kafka Streams and get comfortable with the differences between the main types.
Apache Kafka® is an enormously successful piece of data infrastructure, functioning as the ubiquitous distributed log underlying the modern enterprise. It is scalable, available as a managed service, and has […]