Apache Kafka® is used in thousands of companies, including some of the most demanding, large scale, and critical systems in the world. Its largest users run Kafka across thousands of machines, processing trillions of messages per day. It’s serving as the backbone for critical market data systems in banks and financial exchanges. It’s part of the billing pipeline in numerous tech companies. It’s also used as a commit log for several distributed databases (including the primary database that runs LinkedIn). In all of these environments the most fundamental concern is maintaining correctness and performance: how can we ensure the system stays up and doesn’t lose data. How do you write software for this type of demanding usage?
The reality is it’s very hard, and there is no silver bullet. Distributed systems are notorious for their subtle corner cases and the difficulty of tracing down and reproducing problems. To try to eliminate them requires a set of practices across the software development lifecycle, from design all the way through to production. Most projects do a good job of documenting their APIs and feature set but do little to document the practices and strategies they take to ensure the correctness of those features. The goal of this blog is to give some insight into how Confluent and the larger Apache Kafka community handles testing and other practices aimed at ensuring quality.
Starting with Design: Kafka Improvement Proposals
The trend in software is away from up-front design processes and towards a more agile approach. This is probably a healthy thing for application development, but for distributed systems we think it is essential to making good software that you start with a good design. What are the core contracts, guarantees, and APIs? How will the new feature or module interact with other subsystems? If you cannot clearly explain these things in a small amount of writing it will be very hard to test whether you’ve implemented them correctly in a large code base. Perhaps even more importantly, the goal of design discussions is to ensure the full development community has an understanding of the intention of a feature so that code reviews and future development maintain its correctness as the code base evolves.
To make sure that we’ve considered these questions, Kafka requires any major new feature or subsystem to come with a design document, called a Kafka Improvement Proposal (KIP). This allows changes to go through a broad and open debate. For example, the discussion about the KIP-98 proposal for exactly-once delivery semantics, a very large feature, took several months but ended up significantly improving the original design. Design discussions often feel slow, but the reality is that a design can evolve much faster than the time required to implement a feature, roll it out at thousands of companies, realize the limitations of the approach, and then redesign and reimplement it.
The Kafka community has a culture of deep and extensive code review that tries to proactively find correctness and performance issues. Code review is, of course, a pretty common practice in software engineering but it is often cursory check of style and high-level design. We’ve found a deeper investment of time in code review really pays off.
The failures in distributed systems often have to do with error conditions, often in combinations and states that can be difficult to trigger in a targeted test. There is simply no substitute for a deeply paranoid individual going through new code line-by-line and spending significant time trying to think of everything that could go wrong. This often helps to find the kind of rare problem that can be hard to trigger in a test.
Testing & Tradeoffs
Software engineers often advocate the superiority of unit tests over integration tests. We’ve found that what is needed is a hierarchy of testing approaches. Unit tests are fast to run and easy to debug, but you need to combine this with much more complete tests that run more fully integrated versions of the software in more realistic environments.
Kafka has over 6,800 unit tests which validate individual components or small sets of components in isolation. These tests are fast to run and easy to debug because their scope is small. But because they don’t test the interactions between modules, and run only in a very artificial environment they don’t provide much assurance as to the correctness of the system as a whole.
In addition to unit tests, we use other pre-commit checks such as findbugs (a static analysis tool) and checkstyle (a style check to keep the code pretty).
Kafka has over 600 Integration tests which validate the interaction of multiple components running in a single process. A typical integration test might set up a Kafka broker and a client, and verify that the client can send messages to the broker.
These tests are slower than unit tests, since they involve more setup work, but provide a good check of correctness in the absence of load or environmental failures such as hard crashes or network issues.
Distributed System Tests
Many software projects are limited to just unit and single process integration tests; however, we’ve found this is insufficient, as they don’t cover the full spectrum of problems that plague distributed data systems: concurrency issues that occur only under load, machine failures, network failures, slow processes, compatibility between versions, subtle performance regressions, and so on. To detect these problems, you must test a realistic deployment of the distributed system in a realistic environment. We call these multi-machine tests system tests to differentiate them from single process/machine integration tests.
Constructing distributed tests isn’t that hard for a system like Kafka: it has well-specified formal guarantees and performance characteristics which can be validated and it isn’t that hard to write a test to check them. What is more difficult is making this kind of test automated, maintainable, and repeatable. This requires making it easy to deploy different distributed topologies of Kafka brokers, zookeepers, stream processors, and other components and then orchestrate tests and failures against the setup. Harder still is making this type of test debuggable: if you run ten million messages through a distributed environment under load while introducing failures and you detect that one message is missing, where do you even start looking for the problem?
To aid us in doing this we created a framework called ducktape. Ducktape does the hard work of creating the distributed environment, setting up clusters, and introducing failures. It helps in scripting up test scenarios, and collecting test results. Perhaps most importantly it also helps to aggregate logs and metrics for all the tests it runs so that failures can be debugged.
The system test framework allows us to build a few types of tests that would otherwise be impossible:
- Stress Tests: These tests run under heavy load and check correctness under load.
- Performance Tests: One-time benchmarks are great, but performance regressions need to be checked for daily.
- Fault Injection Tests: These tests induce failures such as disk errors, network failures, process pauses (to simulate GC or I/O stalls)
- Compatibility Tests: These tests check the compatibility of older versions of Kafka with new versions, or test against external systems such as Kafka Connect connectors.
We run over 310 system test scenarios nightly, comprising over 350 machine hours per day. You can see the nightly results and test scenarios run here.
Ducktape is open source and not Kafka specific, so if you are facing similar problems, it might be worth checking out.
There’s nothing quite like production for finding problems. Fortunately Kafka has a big community of power users that help test Kafka in production-like environments prior to release, often by mirroring production load and running it against new versions of the software to ensure it works in their environment with their usage pattern.
Since all our testing is automated, the code base is in as close as possible to a continually releasable state. This allows LinkedIn and a few other organizations to run off trunk versions that are ahead of the official releases. The philosophy of these organizations is that more frequent upgrades mean lower risk and a smaller set of changes in which to look for any problems.
This tight feedback loop from people running Kafka at scale and the engineers writing code has long been an essential part of development. At Confluent, Confluent Cloud, our hosted Kafka offering, gives us this ability to observe a wide variety of production workloads in a very heavily instrumented environment. This feedback loop is what ensures that the tools, configs, metrics, and practices for at scale operation really work. After all, running software in production is the ultimate test.
Thomas Edison once said that genius is 1% inspiration and 99% perspiration. This is true of testing too! At Confluent, we are working to put in some of that perspiration, so that Kafka and the other components of Confluent platform will continue to be a solid foundation to build on.