When we released Apache Kafka 0.9.0.0, we talked about all of the big new features we added: the new consumer, Kafka Connect, security features, and much more. What we didn’t talk about was something even more important, something that we had spent even more of our time on — correctness, bug fixes, and operability. These are always more important than new features.
According to Apache JIRA, 290 bugs have been fixed for 0.9.0.0 release and some of them are quite important. Even more exciting is the fact that while working on 0.9.0.0, we added a brand new distributed testing framework and over 100 new test scenarios that use this framework. We are now testing replication, node failures, controller failures, MirrorMaker, rolling upgrades, security scenarios, Kafka connect failures, and much more. This allowed us to not only catch many issues for this release but will give us the confidence that we are maintaining the high quality that Kafka is known for in the future.
Here are some of the more noteworthy bugs we caught and fixes for Apache Kafka 0.9.0.0:
- Replication is the backbone of Kafka. When replication goes wrong, bad things happen. In Kafka 0.9.0.0 we fixed varied replication issues. For example, we found and fixed an obscure race condition where if a machine ever gets slow enough that context switching between threads is slower than a remote call, it is possible for a broker to think it has fallen out of sync and as a result delete all its data (KAFKA-2477), min.insync.replica default configuration not working as expected (KAFKA-2114) and replication lag being impossible to configure (KAFKA-1546).
- MirrorMaker is Kafka’s cross-cluster replication tool. In the 0.8 release line, MirrorMaker buffered messages between the consumers reading from source cluster and the producers writing to the destination. Consumed offsets were stored using a separate thread (marking messages as “done”). When MirrorMaker process crashed, in some cases messages in the buffer were considered “done” even though they were never written to the target cluster, thereby losing these messages. Kafka 0.9.0.0 includes a newly refactored MirrorMaker with a simpler design that prevents message loss by making sure message offsets are stored only when we are certain the messages were written safely to the target cluster. (KAFKA-1997).
- Kafka application logs can be too chatty at INFO level but too quiet at WARN level. This makes it difficult to troubleshoot issues and sometimes causes false alarms. In Kafka 0.9.0.0 we cleaned up the logs, making them more managable (See: KAFKA-2504, KAFKA-2288, KAFKA-2251, KAFKA-2522, KAFKA-1461)
- Log Compaction is one of the most exciting Kafka features, enabling a variety of new use-cases. Unfortunately, it also had some nasty bugs, so many users opted out even for use cases where compaction was a natural fit. For 0.9.0.0 we fixed a large number of log compaction bugs and limitations. The biggest improvement is the ability to compact topics with compressed messages (KAFKA-1374), but there was a very large number of additional improvements (KAFKA-2235. KAFKA-2163, KAFKA-2118, KAFKA-2660, KAFKA-1755).
- Connection leak can be an issue in a shared environments when applications connecting to Kafka can’t be relied on to properly close their connections. Kafka 0.9.0.0 includes two patches that make the server much more efficient at detecting and cleaning dead connections (KAFKA-1282, KAFKA-2096).
- Kafka broker metadata includes a list of leaders and replicas for each partition. This metadata is stored in ZooKeeper and is also cached in memory of each broker. 0.9.0.0 release includes multiple bug fixes for cases where the metadata cache falls out of sync (KAFKA-1867, KAFKA-1367, KAFKA-2722, KAFKA-972).
- The Request Purgatory, where client requests wait until they can be responded to, underwent a complete re-write into a far more efficient data structure in 0.9.0.0. In the process we also fixed a bug where the purgatory was growing out of control (KAFKA-2147).
- Producer timeouts for the new producer were not strictly enforced in 0.8.2, so some operations would block for much longer than specified timeout. In 0.9.0.0 the tracking of timeouts was improved and timeouts are now consistent and work as expected (KAFKA-2120).
I expect some of these issues may ring an alarm bell, maybe even a loud and annoying bell, in which case the reason to upgrade to Kafka 0.9.0.0 should be clear. Even if you have not come across any of these issues yet, you don’t know when you will. It is much better to have time to plan for an upgrade, rather than have to upgrade under pressure because your production system just hit a bug that was fixed 8 months ago.