Apache Kafka® applications run in a distributed manner across multiple containers or machines. And in the world of distributed systems, what can go wrong often goes wrong. This blog post covers different ways to handle errors and retries in your event streaming applications. The nature of your process determines the patterns, and more importantly, your business requirements.
This blog provides a quick guide on some of those patterns and expands on a common and specific use case where events need to be retried following their original order. This blog post illustrates a scenario of an application that consumes events from one topic, transforms those events, and produces an output to a target topic, covering different approaches as they gradually increase in complexity.
There are cases when all input events must be processed in order without exceptions. An example is handling the change-data-capture stream from a database.
The following diagram illustrates how events in the source topic are processed or transformed and published to the target topic. An error in the process causes the application to stop and manual intervention is required. Notice that the events in the source topic cannot take any other path.
This is a common scenario where events that cannot be processed by the application are routed to an error topic while the main stream continues. It’s important to note that in this approach there are no conditions that require or support a retry process. In other words, an event can be processed successfully, or it is routed to an error topic.
The following diagram illustrates how events in the source topic can take one of two paths:
What happens if the conditions required to process an event are not available when the application attempts to process the event? For example, consider an application that handles requests to purchase items. The price of an item may be produced by a different application and could be missing at the time that the request is received.
Adding a retry topic provides the ability to process most events right away while delaying the processing of other events until the required conditions are met. In the example, you would route events to the retry topic if the price of the item is not available at the time. This recoverable condition should not be treated as an error but periodically retried until the conditions are met.
In your containerized environment you may dedicate one or two instances to handle the retry process, as few events are expected to follow this path.
The following diagram illustrates how an event in the source topic can take one of three different paths:
There is a very important aspect to highlight with this pattern: Events are not guaranteed to be processed in the same sequence received in the source topic. This is because the retry path is “longer” and usually slower than the normal execution path, as there are fewer retry instances and retries are delayed.
The following diagram illustrates how the first event received (Event 1) reaches the target topic after an event that was received later (Event 2). You should use this pattern only if you are okay with this behavior.
As shown in the previous pattern, adding a retry topic and associated flow provides a few benefits by delaying the execution of some events until the required conditions are met. However, the previous pattern also illustrated that the order at the target topic may change.
There are some conditions where changing the order of the events is not acceptable. For instance, an application that updates the inventory of an item would produce different results if an increase in the inventory is processed before a decrease or vice versa. If there is a trigger that initiates a “purchase request” when an item’s inventory falls below 10 and you currently have 10 units, processing a decrease of inventory before a previously received increase may trigger an unwanted “purchase request”, as the inventory may fall below a desired limit. This could also cause an error if the inventory goes below zero. This final pattern addresses that problem.
In this pattern, the main application needs to keep track of every event routed to the retry topic. When a dependent condition is not met (for instance, the price of an item) the main application stores a unique identifier for the event in a local in-memory structure. The unique event identifiers are grouped by the item that they belong to. This helps the application determine if events related to a particular item, for example, Item A, are currently being retried and therefore subsequent events related to Item A should be sent to the retry path to preserve order.
When the first event that has missing dependencies is received, the main application performs the following tasks:
When the next event is received, the application checks the local store to determine if there are events for that item. If one or more events for the item are found, the application knows that some events are being retried and will route the new event to the retry flow.
In the example, if the application receives an event for Item A, which currently has events that are being retried, the application will not attempt to process the event but will instead route it through the retry flow. This ensures that all the events for Item A are processed in the same order that they were received. The application adds the unique identifier for that new event to the local store and routes to the retry and redirect topics as before.
The retry application handles the events in the retry topic in the order that they are received. When an event is successfully retried and published to the target topic, the retry application instance sends confirmation in the form of a tombstone event to the redirect topic. One tombstone event is published for each successfully retried event.
The main application listens to the redirect topic for tombstone events that signal successful retry. The application removes the messages from the in-memory store. This also allows for subsequent events for the same item to be processed through the main flow.
The final diagram illustrates the path taken by an event for an item that has all the required dependencies and can follow the normal flow. The main application performs the following tasks:
In the case of failure, the in-memory store that the main application was managing will be gone. However, this can easily be restored by reading the events in the redirect topic and initializing that in-memory store.
Error handling and retries are important aspects of the development of all types of applications, and Kafka applications are no exception. The approaches mentioned are not intended to cover all possible aspects but provide guidance that can be adapted to meet your needs.
If you’d like to learn more, check out Confluent Developer to find the largest collection of resources for getting started, including end-to-end Kafka tutorials, videos, demos, meetups, podcasts, and more.
Gerardo Gutierrez Villeda was born in Mexico, earned his bachelor’s degree in electronic engineering from the Universidad Autónoma Metropolitana, and started his career writing code. He has worked at SEI for 20 years and is currently a senior enterprise architect, where he helps define governance, standards, best practices, and the patterns that the SEI development community adopts. He is focused on building a distributed event streaming platform that integrates various heterogeneous systems using Apache Kafka, Kafka Connect and Confluent Schema Registry. Gerardo is a Confluent Certified Developer for Apache Kafka, AWS Certified Solutions Architect – Associate, and an AWS Certified Developer – Associate.