Restoring local state in Kafka Streams applications is indispensable for recovering after a failure or for moving stream processors between Kafka Streams clients. However, restoration has a reputation for being operationally problematic, because a Streams client occupied with restoration of some stream processors blocks other stream processors that are ready from processing new records. When the state is large this can have a considerable impact on the overall throughput of the Streams application. Additionally, when failures interrupt restoration, restoration restarts from the beginning, thus negatively impacting throughput further.
In this talk, we will explain how Kafka Streams currently restores local state and processes records. We will show how we decouple processing from restoring by moving restoration to a dedicated thread and how throughput profits from this decoupling. We will present how we avoid restarting restoration from the beginning after a failure. Finally, we will talk about the concurrency and performance problems that we had to overcome and we will present benchmarks that show the effects of our improvements.