By Ernesto Matos
Our business at Loggi has grown a lot over the past few years, and with that expansion came the realization that our systems had to be more distributed. We pushed our architecture to a new level so we could keep up with the company's growth by building new event-driven systems and real-time data analytics tools.
This new distributed architecture allowed us to scale Loggi’s offerings to other regions, from less than 1,000 to more than 3,000 cities in Brazil. We’ve moved from a static infrastructure into a platform that serves as an internal product and an internal developer platform. And we’ve helped teams move faster as we’ve added services and broken down our monolithic system.
Read on to learn how we tackled building an event-driven architecture with Apache Kafka, then Confluent Cloud, to improve data analytics and more, helping Loggi’s logistics work faster, easier, and in real time.
The core of Loggi’s platform had been run using monolithic code. Loggi provides logistics services across Brazil, and as the business expanded, this monolith slowed down the product development process and increased the availability risk. This led us to an event-driven architecture to distribute domains and move data across them easily. We also wanted to make members of our engineering team more productive—that’s our primary goal as the infrastructure and platform team. With any technology decision, we’re giving people freedom to solve the business issues they’re working on, not the architecture issues.
Kafka was the most obvious choice when evaluating the event streaming platform to be the backbone for our event-driven architecture since it is used by thousands of companies, including over 80% of the Fortune 100 in their data streaming platforms and event-driven designs. We wanted people to easily be able to use our event-driven architecture, and this project was part of that bigger goal. As we choose technology to address business goals, we also design teams to respond to business needs. Then we’re responding to a specific business problem, not a technology movement.
Besides simplicity, these were the requirements when we were designing this new architecture:
It had to work across all of our technology stacks;
It should be able to abstract most of the complexities of Kafka;
The events produced should be made available for consumption in our data lake for historical analysis;
It needed to have strong transactional guarantees.
Operating a Kafka deployment is a cumbersome job, mainly because Kafka is a large complex system on its own. On top of that, it brings in additional complexities when integrating with client systems, which can mess up a lot of things. Also, we did not want to have a dedicated and specialized team to manage Kafka clusters. With these requirements in mind, we moved to Confluent Cloud in 2017 to leverage their products and expertise. Our first Kafka use case was moving data from a monolithic 20TB database using CDC. We chose Confluent to move this data, using a CQRS architecture. Today this data powers our real-time analytics tool that’s used by our operations teams.
Moving to an event-driven architecture required many important decisions along the way. Here’s what we learned.
We try to avoid complexity in our systems as much as possible, and developing simple solutions is one of our core values. So, if something can be done in a synchronous request, we do it. For example, imagine a scenario where a user is trying to perform some action in an interface (web or mobile). We encourage people to make a synchronous request in this case, mainly to better handle failures. Instead of having a very complex architecture to deal with that failure or being exposed to a silent data bug, the user can simply retry by hitting a button again.
Synchronous requests are simple but we still need to be careful, so we follow these practices:
We try to use user-initiated requests as much as possible. This has strong implications for the user experience in the system and it helps to not get overwhelmed by automatic retries. It also makes the correlation between system behavior and look-and-feel stronger
If it is a request that hits several services, we don’t mix synchronous requests with asynchronous requests
Also, we avoid requests that are chained by several microservices (e.g. Service A calls Service B, Service B calls Service C, and so on) because they can result in cascade failures hurting reliability. As a rule of thumb, we try to avoid call chains deeper than two services
If we need to make requests that change data in several services, we should be able to roll back the changes, otherwise we will leave inconsistent data in one or more services
The order in our code is important. If the code must make a request to an external service and save something in the database in a transaction, we don’t save the data in the database before we are sure the request returned a successful response
Always apply backpressure techniques (timeouts, circuit breaks, etc.). It is important to avoid a system overload.
Instead of letting every team code their own Kafka producers and consumers and deal with the complexity, we developed a thin layer of infrastructure to abstract some of it. Since Kafka is a low-level piece of infrastructure, we can configure it in many ways and it requires some effort to understand which configurations we should use for a particular use case (e.g., should we use idempotent producers or transactional producers?). The main idea was that Kafka should be as invisible as possible to everyone in our engineering team.
One of the main benefits of the architecture we designed is that it’s been very simple for people to get on board. They don’t need to know much about Kafka to use this. We had real issues that we wanted to solve, and we designed this accordingly. That helps a lot with user adoption.
One of the main lessons learned was that we could have implemented the Event Broker microservice differently, given that it ends up being a single point of failure. Even though we never had any major incident with it, this is still a concern that floats around our heads from time to time. We could have implemented it in a more distributed way, for example, as a sidecar for each consumer service.
We also experienced some organizational trade-offs. Given that the Event Broker microservice was maintained by a single platform team, we ended up creating a strong dependency on this team for a lot of other engineering groups. If a team needed a new feature in this architecture, they had to wait until someone from the platform team became available to help them. The platform team built the new architecture and provides capabilities to all engineering teams—around 200 engineers.
Nowadays, we avoid this dependency. Instead of providing a single solution for everyone, we are concentrating our efforts on providing paved roads with a few paths that people can choose. We’re now allowing people to write events directly into Kafka if they choose, and let users consume events directly from the Kafka topics when it makes sense for them. We want to make sure there’s freedom for users to choose alternatives to encourage experimentation from the business side.
When you build something like this event-driven architecture, some team members will need to go deep. Someone will have to understand the internals and really understand how Kafka works to know what an error in a log message means, for example. Having access to support helps, but first, understand your system, how you built it, and how it will actually respond to issues. Documentation is also super important so that you’ll be able to see why and how decisions were made—why something was chosen over another option.
We also spent a lot of energy on observability and implemented logs and custom metrics so people can easily see what is happening under the hood with their events. It’s key to have good observability to see what’s happening under the hood and be able to debug your own problems.
We did not experience any major issues with any of the components of this architecture, and when it comes to performance, we had to implement some fine-tuning for a few high-volume events. Originally, we had one Kafka consumer for each consumer service, and for high-volume events, this could result in some competition for resources. Luckily it was an easy fix and we just had to configure dedicated Kafka consumers for these events.
There is a lot going on in this image—you can find a technical deep dive, complete with code samples, in our Medium post.
In short, the architecture includes these functions:
An event is created in a producer service. We provide a simple API, which is just a single function, to produce events that use schemas defined as Protocol Buffers in our shared protocol buffer repository. To avoid making unnecessary requests to other services to enrich event data, we also adopted the Event-Carried State Transfer concept in this design, which means that we add all information required by a consumer service in the event to ensure these types of requests don’t overload other services, especially during incident recoveries.
We also made sure this design has strong transactional capabilities as we did not want to worry about distributed transactions. Suppose we have a method in our producer service that saves some relational data in the database and also sends an event to Kafka. All of this should happen in the same transaction bit. If we fail to send the event to Kafka, we should roll back the changes we made in the relational database, which increases the complexity, so we wanted it to work like a simple database transaction that most people already know. To reach this goal, we’ve implemented the transactional Outbox Pattern.
Once an event is produced and reaches its designated Kafka topic, it will be consumed by a microservice called Event Broker as shown in the following image:
The Event Broker is responsible for reading events from Kafka and sending them to the consumers that need to receive them. The events are sent to the consumer via gRPC requests. We do not have any guarantees on the order of event deliveries. Since we are doing automatic retries, that makes in-order deliveries very difficult.
One of our goals was to make it as effortless as possible to add new events. In the end, we reached a point where an engineer only needs to do two things:
Add a configuration in the Event Broker telling it which consumer services need what events;
Implement a gRPC server to receive the event in the consumer service.
As for the gRPC server in the consumer service, its interface is also defined in our shared protocol buffer repository and each service that wants to listen to an event has to implement this interface in their gRPC servers.
We have a retry mechanism in place, which resends the event to the consumer service N times, where N is the value set in the Event Broker configuration that we mentioned before. We also have in-memory retries that use exponential backoffs and circuit breakers. We learned a lesson after past retry implementations DoS’ed some of our services to death.
This architecture has been running in production for more than two years now and we have dozens of types of events created, from package events to cargo transfer and accounting events. The Event Broker delivers millions of events each day to several consumer services.
In the end, the design was practical enough for our engineering team to engage and be productive; it’s easy to create and consume new events.
Other projects using this architecture include enabling delivery drivers to see live earnings in real time. They weren’t in the past able to see how much money they would receive for a specific delivery. Our integration and billing teams are trying out Kafka Streams or kSQLdb, which are well-suited to billing since finance has a time window constraint. People also use Kafka Streams to send data to clients about shipping status and other updates.
We want our teams to have the freedom to experiment and innovate without worrying about Kafka architecture.