Register for Demo | Confluent Terraform Provider, Independent Network Lifecycle Management and more within our Q3’22 launch!

290 Reasons to Upgrade to Apache Kafka

When we released Apache Kafka, we talked about all of the big new features we added: the new consumer, Kafka Connect, security features, and much more. What we didn’t talk about was something even more important, something that we had spent even more of our time on — correctness, bug fixes, and operability. These are always more important than new features.

According to Apache JIRA, 290 bugs have been fixed for release and some of them are quite important. Even more exciting is the fact that while working on, we added a brand new distributed testing framework and over 100 new test scenarios that use this framework. We are now testing replication, node failures, controller failures, MirrorMaker, rolling upgrades, security scenarios, Kafka connect failures, and much more. This allowed us to not only catch many issues for this release but will give us the confidence that we are maintaining the high quality that Kafka is known for in the future.

Here are some of the more noteworthy bugs we caught and fixes for Apache Kafka

    1. Replication is the backbone of Kafka. When replication goes wrong, bad things happen. In Kafka we fixed varied replication issues. For example, we found and fixed an obscure race condition where if a machine ever gets slow enough that context switching between threads is slower than a remote call, it is possible for a broker to think it has fallen out of sync and as a result delete all its data (KAFKA-2477), min.insync.replica default configuration not working as expected (KAFKA-2114) and replication lag being impossible to configure (KAFKA-1546).
    2. MirrorMaker is Kafka’s cross-cluster replication tool. In the 0.8 release line, MirrorMaker buffered messages between the consumers reading from source cluster and the producers writing to the destination. Consumed offsets were stored using a separate thread (marking messages as “done”). When MirrorMaker process crashed, in some cases messages in the buffer were considered “done” even though they were never written to the target cluster, thereby losing these messages. Kafka includes a newly refactored MirrorMaker with a simpler design that prevents message loss by making sure message offsets are stored only when we are certain the messages were written safely to the target cluster. (KAFKA-1997).
    3. Kafka application logs can be too chatty at INFO level but too quiet at WARN level. This makes it difficult to troubleshoot issues and sometimes causes false alarms. In Kafka we cleaned up the logs, making them more managable (See: KAFKA-2504, KAFKA-2288, KAFKA-2251, KAFKA-2522,  KAFKA-1461)
    4. Log Compaction is one of the most exciting Kafka features, enabling a variety of new use-cases. Unfortunately, it also had some nasty bugs, so many users opted out even for use cases where compaction was a natural fit. For we fixed a large number of log compaction bugs and limitations. The biggest improvement is the ability to compact topics with compressed messages (KAFKA-1374), but there was a very large number of additional improvements (KAFKA-2235. KAFKA-2163, KAFKA-2118, KAFKA-2660, KAFKA-1755).
    5. Connection leak can be an issue in a shared environments when applications connecting to Kafka can’t be relied on to properly close their connections. Kafka includes two patches that make the server much more efficient at detecting and cleaning dead connections (KAFKA-1282, KAFKA-2096).
    6. Kafka broker metadata includes a list of leaders and replicas for each partition. This metadata is stored in ZooKeeper and is also cached in memory of each broker. release includes multiple bug fixes for cases where the metadata cache falls out of sync (KAFKA-1867, KAFKA-1367, KAFKA-2722, KAFKA-972).
    7. The Request Purgatory, where client requests wait until they can be responded to, underwent a complete re-write into a far more efficient data structure in In the process we also fixed a bug where the purgatory was growing out of control (KAFKA-2147).
    8. Producer timeouts for the new producer were not strictly enforced in 0.8.2, so some operations would block for much longer than specified timeout. In the tracking of timeouts was improved and timeouts are now consistent and work as expected (KAFKA-2120).

I expect some of these issues may ring an alarm bell, maybe even a loud and annoying bell, in which case the reason to upgrade to Kafka should be clear. Even if you have not come across any of these issues yet, you don’t know when you will. It is much better to have time to plan for an upgrade, rather than have to upgrade under pressure because your production system just hit a bug that was fixed 8 months ago. 

To make things easier, Apache Kafka can be upgraded with no downtime by using rolling upgrades. Check our documentation to learn the exact process and start planning your upgrade.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Autonomous Networks — The Telco and Media Growth Engine

Why are autonomous networks a critical part of all communication service providers’ (CSPs) infrastructure? A recurring issue for CSPs is the integration of legacy operations support services (OSS) systems. In

Modernize Your Hybrid and Multicloud Data Architecture

Whether you were born in the cloud, are just dipping your toes in the water with cloud, or are somewhere in between, chances are your organization is on a cloud

Securing Your Logs in Confluent Cloud with HashiCorp Vault

Logging is an important component of managing service availability, security, and customer experience. It allows Site Reliability Engineers (SREs), developers, security teams, and infrastructure teams to gain insights to how