Ahorra un 25 % (o incluso más) en tus costes de Kafka | Acepta el reto del ahorro con Kafka de Confluent
This article was originally published on InfoWorld on Jan. 28, 2025
While large language models (LLMs) are useful, their real power emerges when they can act on insights, automating a broader range of problems.
Reasoning agents have a long history in artificial intelligence (AI) research—they refer to a piece of software that can generalize what it has previously seen to apply in situations it hasn’t seen before. It’s like having a decision-making robot that can adapt based on what’s happening around it.
But the real excitement comes when reasoning agents work together in multi-agent systems.
Imagine assembling a dream team, where each member has a unique skill set but collaborates toward a shared goal. Multi-agent systems enable this kind of teamwork, relying on networks of agents that communicate, share context, and coordinate actions. These systems excel at solving complex challenges too big for any single agent—or person—to handle.
Of course, with great power comes great complexity.
Coordinating multiple agents presents challenges familiar to anyone who’s ever worked on a group project. There’s miscommunication, overlapping responsibilities, and difficulty aligning toward a common objective. Now, scale that to dozens—or hundreds—of autonomous agents, each acting independently but needing to stay in sync.
This article explores how event-driven design—a proven approach in microservices—can address the chaos, creating scalable, efficient multi-agent systems. If you’re leading teams toward the future of AI, understanding multi-agent design patterns is critical.
Let’s dive in.
Managing multi-agent systems introduces unique difficulties:
Context and data sharing: Agents must exchange information accurately and efficiently, avoiding duplication, loss, or misinterpretation.
Scalability and fault tolerance: As the number of agents grows, the system must handle complex interactions while recovering gracefully from failures.
Integration complexity: Agents often work with diverse systems and tools, requiring seamless interoperability.
Timely and accurate decisions: Agents need to make real-time decisions based on fresh, up-to-date data to ensure responsiveness and avoid poor outcomes.
Safety and validation: Guardrails are essential to prevent unintended actions, and stochastic outputs demand rigorous quality assurance.
Overcoming these challenges requires more than just thoughtful coordination—it calls for proven design patterns tailored for multi-agent systems. The next section dives into these patterns and demonstrates how they can be implemented using event-driven design to unlock scalable, reliable, and efficient multi-agent architectures.
Multi-agent design patterns define the interaction structures that enable agents to communicate, collaborate, or compete to solve problems. By focusing on the problem domain and the nature of agent interactions, these patterns offer solutions for coordinating autonomous entities in a range of scenarios.
The following explores four key patterns: orchestrator-worker, hierarchical agent, blackboard, and market-based. We show how each of these common multi-agent patterns are transformed into event-driven distributed systems, gaining the operational advantages of data streaming applications and removing the need for specialized communication paths for agent orchestration. We also describe the event-driven version of these patterns using conceptual models from Apache Kafka®. For anyone unfamiliar with Kafka, an accessible tour of its foundations can be found here.
In this pattern, a central orchestrator assigns tasks to worker agents and manages their execution. This pattern, similar to the master-worker pattern in distributed computing, ensures efficient task delegation and centralized coordination while allowing workers to focus on specific, independent tasks.
Using data streaming, you can adapt this pattern to make the agents event-driven. Data streaming technologies like Kafka offer key-based partitioning strategies, so the orchestrator can use keys to distribute command messages across partitions in a single topic. Worker agents can then act as a consumer group, pulling events from one or more assigned partitions to complete the work. Each worker agent then sends output messages into a second topic where it can be consumed by downstream systems.
The pattern now looks like this:
While this diagram looks more complex, it dramatically simplifies the operations of the system.
The orchestrator no longer has to manage its connections to worker agents, including managing what happens if one dies or handling more or fewer worker agents. Instead, it uses a keying strategy that distributes work across partitions. For events that should be processed by the stateful worker agent as some previous message, the same key can be used for each event in a sequence. The worker agents gain the benefits of any consumer group.
The worker agents pull from one or more partitions, and the Kafka Consumer Rebalance Protocol assures that each worker has similar workloads even as worker agents are added or removed. In the event of a worker failure, the log can be replayed from a given partition for a saved offset. The orchestrator no longer needs bespoke logic for managing workers; instead, it simply specifies work and distributes it with a sensible keying strategy. Similarly, the worker agents inherit the functionality of a Kafka consumer group, so they can use common machinery for coordination, scaling, and fault recovery.
In this pattern, agents are organized into layers, where higher-level agents oversee or delegate tasks to lower-level agents. It’s particularly effective for managing large, complex problems by breaking them into smaller, more manageable parts.
To make the hierarchical pattern event-driven, you apply the same techniques for decomposing work in the orchestrator-worker pattern recursively in the agent hierarchy such that each non-leaf node is the orchestrator for its respective subtree.
In the example above, Mid-Level Agent #1 is itself an orchestrator for its leaf agents. Its entire workflow is functionally encapsulated into its role as a worker orchestrated by the Top-Level Agent.
The hierarchical topology depicted in the previous diagram now looks like the image below:
For the event model, note that topics are logical swimlanes for agent-specific functional workloads, so siblings in the tree structure will form consumer groups processing the same topics as depicted above. By making the hierarchical organization event-driven, you make the system asynchronous, greatly simplifying the conceptual model for data flow. Your operations are more resilient as the topography is no longer hardcoded: agents can be added or removed from sibling groups without the individual agents having to manage this change or faults in the communication paths.
The blackboard pattern provides a shared knowledge base—a "blackboard"—that agents use to post and retrieve information. This pattern enables agents to collaborate asynchronously without direct communication. It is especially useful for solving complex problems requiring incremental, multi-agent contributions.
You can adapt this pattern to be event-driven in a straightforward way.
The blackboard becomes a data streaming topic consisting of messages produced from and consumed by the worker agents. If needed, a keying strategy or payload fields can be used to annotate which agent originated the event.
The event-driven version looks like this:
Again, this creates a significant operational simplification and reduces the amount of bespoke logic that must be created outside of the infrastructure. Each worker agent simply produces and consumes events in order to collaborate with the rest of the group.
This pattern models a decentralized marketplace where agents negotiate and compete to allocate tasks or resources.
For example, solver or bidding agents can exchange responses with each other to refine their responses. This process is repeated for a fixed number of rounds where a final answer is compiled by an aggregator agent based on the final responses from all agents.
Financial services have long used data streaming platforms as systems of record for the world’s largest stock exchanges. Data streaming systems like Kafka and Confluent even run many high throughput over-the-counter securities markets. This is commonly implemented with a topic for bids and another for asks to which each solver agent publishes events. A simple market maker service creates transactions where bids and asks are matched and publishes notifications of these events to a third topic that the solver agents consume.
This is an important simplification as it eliminates the quadratic connections that otherwise occur between the solver agents, which are difficult to manage in the presence of many agents or as agents are added or lost.
The pattern now looks like this:
In making each of these patterns event-driven, we’ve operated under the premise that agents are driven by events. Let’s dig into that a bit further next.
The outlined design patterns depend on a shared operating model for seamless agent coordination, similar to microservices.
At the core of this model is a shared language—a way for agents to exchange information, maintain alignment, and collaborate efficiently. Events serve as this language, acting as structured updates that enable agents to interpret instructions, share context, and coordinate tasks. Think of it as the system’s group chat: keeping agents synchronized and integrating new ones smoothly.
Here’s what this shared language enables:
Interpret commands: Agents receive clear, standardized instructions, like JSON payloads, guiding their actions.
Share context: Agents broadcast updates consistently, avoiding duplication and ensuring mutual understanding.
Coordinate tasks: Agents perform independent actions aligned toward shared objectives, even in dynamic or unpredictable environments.
This is where interfaces play a critical role. Agents must be designed to react to events and commands rather than act in isolation, ensuring they integrate seamlessly into a larger, event-driven ecosystem.
A critical insight that serves as a liberating simplifying assumption is that these agents don’t divine action; rather, they react to upstream events or commands. Operating within dynamic, interconnected environments, agents can be modeled with three components:
Input: Consuming events or commands.
Processing: Applying reasoning or gathering additional data.
Output: Emitting actions for downstream consumers.
This reactive design mirrors microservices, enabling the use of proven design patterns for scalable, efficient system development.
Drawing again from our connection to event-driven microservices, traditionally, parts of a system interact through a request/response model. While straightforward, this approach struggles with scalability and real-time responsiveness, introducing delays and bottlenecks as systems grow. It’s akin to needing permission for every action, which slows down operations.
The evolution toward an event-driven architecture marks a pivotal shift.
In this model, agents are designed to emit and listen for events autonomously. Events act as signals that something has happened, allowing agents to respond without requiring direct, orchestrated requests. This approach ensures agility, scalability, and a more dynamic system.
Agent interfaces in event-driven systems are defined by the events they emit and consume, encapsulated in simple, standardized messages like JSON payloads. This structured design:
Simplifies how agents understand and react to events.
Promotes reusability of agents across different workflows and systems.
Enables seamless integration into dynamic, evolving environments.
For example, a health monitoring agent can emit alerts when thresholds are breached, effortlessly integrating into workflows without custom dependencies.
For a distributed system to function harmoniously, maintaining a consistent state across all agents is critical. This is where the concept of an immutable log comes into play. Every event or command an agent processes is recorded in a log that is permanent and unchangeable. Acting as a single source of truth, the log ensures all agents operate with the same context, enabling:
Reliable coordination and synchronization.
Resilience through replayable events, allowing recovery from failures.
Sophisticated consumer models, where multiple agents can respond to the same event without confusion or overlap.
This approach dramatically improves system reliability, ensuring that agents work cohesively to achieve shared goals, even in complex or unpredictable environments.
Multi-agent systems are redefining what’s possible in AI. But to realize their full potential, we must overcome challenges like scalability, fault tolerance, and real-time decision-making.
Event-driven design offers a clear path forward.
As AI applications grow more sophisticated, event-driven multi-agent systems will be crucial for tackling real-world complexity. By adopting this model and standardizing communication between agents, we create a foundation that is resilient, efficient, and adaptable to changing demands, unlocking the full potential of these architectures.
Apache®, Apache Kafka®, and Kafka® are registered trademarks of Apache Software Foundation.
Confluent, powered by Kafka, is the real-time backbone for agentic systems built with Google Cloud. It enables agents to access fresh data (MCP) and communicate seamlessly (A2A) via a decoupled architecture. This ensures scalability, resilience, and observability for complex, intelligent workflows.
AWS Lambda's Kafka Event Source Mapping now supports Confluent Schema Registry. This update simplifies building event-driven applications by eliminating the need for custom code to deserialize Avro/Protobuf data. The integration makes it easier and more efficient to leverage Confluent Cloud.