Build Services on a Backbone of Events

Build Services on a Backbone of Events

Ben Stopford

For many, microservices are built on a protocol of requests and responses. REST etc. This approach is very natural. It is after all the way we write programs: we make calls to other code modules, await a response and continue. It also fits closely with a lot of use cases we see each day: front facing websites where users hit buttons and expect things to happen.

Connected worldBut when we step into a world of many independent services, things start to change. As the number of services grows gradually over time, the web of synchronous interactions grows with them. Previously benign availability issues start to trigger far more widespread outages. 

Our unfortunate Ops Engineers end up donning deerstalkers, as they play out distributed murder mysteries. Frantically running from service to service, piecing together snippets of second hand information (who said what, to whom, and when?).

This is a well known problem and there are a number of solutions. One is to ensure your individual services have significantly higher SLAs than your system. Google provide a protocol for doing this. The alternative is to simply break down the synchronous ties which bind services together.

We can use asynchronicity as a mechanism for doing this. If you’re working in online retail you’ll find synchronous interfaces like getImage() or processOrder() feel natural. Calls which expect an immediate response. But when a user clicks ‘Buy’ they trigger a complex and asynchronous process. A process that takes a purchase and physically delivers it to the user’s door, way beyond the context of the original button-click. So splitting software into asynchronous flows allows us to compartmentalise the different problems we need to solve, as well as allowing us to embrace a world that is itself, inherently asynchronous.

In practice we tend to embrace this automatically. We’ve all found ourselves polling database tables for changes, or implementing some kind of scheduled cron job to churn through updates. These are simple ways to break the ties of synchronicity, but they always feel like a bit of a hack. There is a good reason for this. They probably are.

So we can condense all these issues into a single observation. The imperative programming model, where we command services to do our bidding, isn’t a great fit for estates where services are operated independently.

In this post we’re going to look at the other side of the architecture coin: composing services not through chains of commands, but rather through streams of events. This is a valid approach in its own right. It also forms a baseline for the more advanced patterns we’ll be discussing later in this series, where we blend the ideas of event-driven processing with those seen in Streaming Platforms.

Request driven way

Commands, Events and Queries

Before we dive into an example, we need to fix three simple concepts. There are three ways that services can interact with one another: Commands, Events and Queries. If you’ve not considered the distinction between these three before, it’s well worth doing so. 

Commands, event, query

The beauty of events is they are both fact and trigger. Data-on-the-outside that can be reused by any service in the system. But from a services perspective Events lead to less coupling than Commands and Queries. This fact is important.

The three mechanisms through which services interact:
  1. Commands are an action. A request for some operation to be performed in another service. Something that will change the state of the system. Commands expect a response.
  2. Events are both a Fact and a Trigger. Something that has happened, expressed as a notification.
  3. Queries are a request to look something up. Importantly, queries are side effect free; they leave the state of the system unchanged.


A Simple Event Driven Flow

Let’s start with a simple example: a customer ordering a widget. Two things happen next:

  1. The associated payment is processed.
  2. The system checks to see if more widgets need to be ordered.

In the request-driven approach this can be represented as a chain of commands. There are currently no queries. The interaction would look like this:

simple event driven flow

The first thing to note is the business process for “Buying more Stock” is initiated by (called by) the Orders Service. This mingles responsibilities across the two services. Ideally we’d have a better separation of concerns.

Now if we can represent the same flow using an Event Driven approach things work a little better.

  1. The UI Service raises OrderRequested and awaits OrderConfirmed (or Rejected) before returning to the user.
  2. Both the Orders Service and the Stock Service react to the raised event.

UI services

Looking at this closely, the Interaction between the UI service and the Orders Service hasn’t changed all that much, other than they communicate via events, rather than calling one another directly.

The Stock service is interesting though. No longer does the Orders Service tell it what to do. It controls whether it partakes in the interaction, or not. This is a very important property of this type of architecture, termed Receiver Driven Routing. Logic is pushed to the receiver of the events, rather than the sender. The burden of responsibility is flipped!

Flipping control to receivers reduces the coupling between services, which opens up an important level of pluggability to the architecture. Components can be swapped in and out with ease.

This element of pluggability becomes increasingly important as architectures become more complex. Say we were to add a service that manages pricing in real time, tweaking a product’s price based on supply and demand. In a command-driven world we would need to introduce a maybeUpdatePrice() method which is called by both the Stock Service and the Orders Service. But in the event-driven world re-pricing is just a service that subscribes to the shared stream, sending out price updates when relevant criteria are met.

UI service, orders service, pricing service, stock service

O’Reilly Book: Designing Event Driven Systems
Explore all these concepts in detail with the free O’Reilly book “Designing Event Driven Systems. Concepts and Patterns for Streaming Services with Apache Kafka”

Blending Events and Queries

The above example considered only commands/events. There were no queries (remember we defined all interactions as being one of commands, events and queries earlier). Queries are a necessity for all but the simplest architecture. So let’s extend the example a bit, making the Orders Service check that there is sufficient Stock before it processes a Payment.

The request-driven approach to this would involve sending a query to the Stock service to retrieve the current stock count. This leads to a hybrid model, where the event stream is used purely for notification, allowing any service to dip into the flow, but queries go directly to source.

ui service, orders service, REST stock service

For larger ecosystems, where services need to evolve independently, remote queries add a lot of coupling, tying services together at runtime. We can avoid such cross-context queries by internalising them. The stream of events is used to cache datasets in each service, where they can be queried locally.

So to add this stock check, the Orders Service would subscribe to the stream of Stock events, storing them locally in a database. It would then query this ‘view’ to validate there is sufficient stock.

Orders Service would subscribe to the stream of Stock events

Pure event-driven systems have no concept of remote queries – events propagate state to services where it is queried locally

There are three advantages of this ‘Query by Event-Carried State Transfer’ approach:

  1. Better decoupling: Queries are local. They involve no cross-context calls. This ties services far less tightly than their request-driven brethren.
  2. Better autonomy: The Orders service has a private copy of the Stock dataset so it can do whatever it likes with it, rather than being limited to the query functionality offered by the Stock Service.
  3. Efficient Joins: If we were to “look up the stock” on every order, we would effectively be doing a join over the network between the two services. As workloads grow, or more sources need to be combined, this can be increasingly arduous. ‘Query by Event Carried State Transfer’ solves this issue by bringing queries (and joins) local.

This approach isn’t without its downsides though. Services become inherently stateful. They need to keep track of, and curate, the propagated data set over time. The duplication of state can also make some problems harder to reason about (how do we decrement the stock count atomically?) and we should be careful of divergence over time. But all these issues have workable solutions, they just need a little consideration. This is well worth the effort for larger, more complex estates. 

The Single Writer Principle

A useful principle to apply with this style of system is to assign responsibility for propagating events of a specific type, to a single service: a single writer. So the Stock Service would own how the ‘Inventory of Stock’ progresses forward over time, the Orders Service would own Orders, etc.

This helps funnel consistency, validation and other ‘write path’ concerns through a single code path (although not necessarily a single process). So in the example below, notice that the Order Service takes control of every state change made to an Order, but the whole event flow spans Orders, Payments and Shipments, each managed by their respective services.

Assigning responsibility for event propagation is important because these aren’t just ephemeral events, or transient chit-chat. They represent shared facts, the data-on-the-outside. As such, services need to take responsibility for curating these shared datasets over time: fixing errors, handling situations where schemas change etc.

Here each color represents a topic, in Kafka, for Orders, Shipments and Payments. The basket service starts the ball rolling. When a user clicks ‘buy’ it raises Order Requested, waiting for the Order Confirmed event before it responds back to the user. The three other services handle the state transitions that pertain to their section of the workflow. So for example, after the payment is processed, the Order service pushes the Order from Validated to Confirmed.

Blending Patterns and Clustering Services

For some the pattern described above will look like Enterprise Messaging, but it is subtly different. Enterprise Messaging, in practice, focuses on state transfer, effectively tying databases together across a network.

Event Collaboration is about services progressing some business goal, through a cascade of events, which trigger services into action. So it’s a pattern for business processing, rather than simply a mechanism for moving state.

But we typically want to leverage both “faces” of this pattern in the systems we build. In fact, one of the beauties of this pattern is it can handle both the micro and the macro, or be hybridized where it makes sense.

Combining patterns is also quite common. We might want the flexibility of remote queries rather than the overhead of maintaining a dataset locally, particularly as datasets grow. This makes it easier to deploy simple functions (which is important if we want to compose lightweight, serverless-styled, event flows), or because we’re in a stateless container or browser.

The trick is to limit the scope of these query interfaces, ideally to within a bounded context*. It’s typically better to have an architecture with many specific, targeted views rather than a single, shared datastore. (*A bounded context, here, is a set of services which share the same deployment cycle or domain model.)

To limit the scope of remote queries we can use a clustered context pattern. Here event flows are the sole communication pattern between contexts. But services within a context leverage both event-driven processing and request-driven views, where they are needed.

In the example below we have three divisions, which communicate with one another only through events. Inside each one we use finer grained event-driven flows. Some of these include a view tier (query layer). This balances coupling with convenience, allowing us to blend fine-grained services with the larger entities; the legacy applications or off the shelf products, which inhabit many real world service estates.

event tier & view tier

Clustered Context Model

The five key advantages of event-driven services:
  1. Decoupling: Break long chains of blocking commands. Decompose synchronous workflows. Brokers decouple services so it’s easier to plug new ones in, or change those that are there. 
  2. Offline/Async flows: When the user clicks a button many things happen. Some synchronously, some asynchronously. Design for the latter and the former falls out for free. 
  3. State Transfer: event become the dataset of your system. Streams provide an efficient mechanism for distributing datasets, so they can be reconstituted, and queried, inside a bounded context.
  4. Joins: It’s easier to combine/join/augment datasets from different services. Joins are fast and local.
  5. Traceability: It’s easier to debug the distributed ‘murder mystery’ when there’s a central, immutable, retentive narrative journaling each interaction as it unfolds in time.

Summing Up

So in the event driven approach we use Events instead of Commands. Events trigger processing. They are also turned into views we can query locally. We revert back to remote, synchronous queries where necessary, particularly in smaller ecosystems, but we limit their scope in larger ones (ideally to a single bounded context). 

But all these approaches are just patterns. Guiding principles for sewing systems together. We shouldn’t be too dogmatic about them. For example a global query service can still be a good idea, if it is something that rarely changes (like a Single Sign On service).

The trick is to start with a baseline of events. Events provide far less opportunity for services to couple themselves to one another, and flipping flow-control to the receiver makes for better separated concerns and better pluggability.

The other interesting thing about event-driven approaches is they work as well for large, complex architectures as they do for small, highly collaborative ones. A backbone of events gives services the autonomy they need to evolve freely. Missing the complex ties of commands and queries. As for the Ops engineers, they’ll still be wearing deerstalkers and playing murder mystery, but hopefully not quite as often, and at least now the story comes with a script!

But with all this talk of events, we’ve talked little of distributed logs or stream processing. When we apply this pattern with Kafka the rules of the system change. Retention in the broker becomes a tool we can design for, allowing us to embrace data-on-the-outside. Something services can refer back to.

Streaming Platforms sit very naturally with this model of processing events and building views. Views embedded directly inside a service, views a service queries remotely, or views materialised as an on-going stream.

This leads to a whole host of optimisations: leveraging the duality between event stream and event store, blending in stream processing tools that sift through the narrative, joining streams from many services and materializing views we can query. These patterns are empowering. They allow us to reimagine business processing using a toolset specifically designed for handling streams of events. But all these optimisations build on the concepts discussed here, simply applied with a more contemporary toolset.

In the next post, we will look at how we can use the various features Kafka offers to build scalable, highly-available event driven services.

Thanks to Antony Stubbs, Tim Berglund, Kaufman Ng, Gwen Shapira and Jay Kreps for their help reviewing this post.

Posts in this Series:

Part 1: The Data Dichotomy: Rethinking the Way We Treat Data and Services
Part 2: Build Services on a Backbone of Events
Part 3: Using Apache Kafka as a Scalable, Event-Driven Backbone for Service Architectures (Read Next)
Part 4: Chain Services with Exactly Once Guarantees
Part 5: Messaging as the Single Source of Truth
Part 6: Leveraging the Power of a Database Unbundled
Part 7: Building a Microservices Ecosystem with Kafka Streams and KSQL

O’Reilly Book: Designing Event Driven Systems

Find out More:

Subscribe to the Confluent Blog


More Articles Like This

Four Pillars of Event Streaming
Neil Avery

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

Neil Avery .

So far in this series, we have recognized that by going back to first principles, we have a new foundation to work with. Event-first thinking enables us to build a ...

Test Machine at Funding Circle
Andy Chambers

Testing Event-Driven Systems

Andy Chambers .

So you’ve convinced your friends and stakeholders about the benefits of event-driven systems. You have successfully piloted a few services backed by Apache Kafka®, and it is now supporting business-critical ...

JDBC Source Connector
Yeva Byzek

Putting Events in Their Place with Dynamic Routing

Yeva Byzek .

Event-driven architecture means just that: It’s all about the events. In a microservices architecture, events drive microservice actions. No event, no shoes, no service. In the most basic scenario, microservices ...

Leave a Reply

Your email address will not be published. Required fields are marked *


  1. Thanks for the thorough post. It’s interesting to be in this transition period between HTTP/REST and event-driven service communication. While I agree that event-based communication between services for fire-and-forget type communication is ideal, re-purposing event streams for query type communication is still not ideal, especially where systems connect to the wider web which is an ecosystem built around synchronous communication. Protocols like Websocket may change this, but the RPC protocols over Websocket are ad-hoc and widespread tooling (cURL) and standards (OAuth) are not available. Building local views for querying is often times not feasible, especially for more complex systems like search. Developers working on systems whose sole interprocess communication is via event streams have to build special tooling and systems to efficiently debug systems issues or use GUI tools to post/get messages.

    So I think event streams are a key part of a resilient systems architecture, I would resist seeing all microservice communication as the nail for that particular hammer.

    1. Hi Baq
      Thanks for leaving a comment. I certainly agree that repurposing event streams for queries is not a great idea something we’re suggesting here. Not in the sense of ‘synthesizing’ request-response over Kafka. But the event collaboration pattern provides a useful alternative nonetheless.
      Typically a user interface might communicate with its webserver or other services via rest or websockets or whatever feels most natural. But the webserver or other services would rely on views they create, if they use data from other backend services. These views would be in a database of some form, whatever one best suits the use case. This is a pretty common pattern in the wild.
      I would propose that having all your interaction go through Kafka makes it easier to debug issues. There is a journal to refer back to. As the tooling in this area becomes richer I think this will become an increasingly valuable asset, but there remains more work to be done.
      But what is important to note, and maybe I should have stressed more strongly here, is that this is pattern for a service architecture. Specifically services that are independently deployable, run by different teams etc. There are many use cases where a (distributed) monolith is a far better pattern to use. In such use cases its easier for services to share datastores, so the benefits of event based communication are to provide a trigger for async processing rather than also being a mechanism for moving state between services.

  2. Hi Ben
    thanks for great blog. I think, instead of building own local data with stock counts, the order service could listen to ItemsReserved event propagated by the stock service (after processing OrderConfirmed event). This would solve the atomic counter of items in the stock. But let say you have to send email notification to the user. This service would need some data from different servies (user info, email, ordered items with some details). Without local data, the email service would need to query other services or the event OrderApproved which triggers email notification would contain all necessary data (collected in the whole process). So, with local store we can not only decouple services, but also keep events really simple.


    1. Hi Tom

      Yes that should work. There are a few options that should. For the email service also, it’s a boonI I’ll clarify the post a bit, although I don’t want to delve into too much detail. Good comment though. Thanks.


  3. One problem that gets overlooked most is how important a typed versioned interface is between micro services; Protobuffer for messaging would be an answer here; also GRPC has async streams which is like the best of both words; Force fitting all scenarios into one solution domain looks counter productive. The idea could be to use both request-response as well as event driven, and with typed ,versioned interface

    1. Hi Alex

      Contracts are most certainly important. Particularly between teams/deployment-units, but also between individual services. I was in two minds about discussing the subject in this post. One consideration that is important I think is that it’s the schema of the async messages that is most important for this particular pattern. Why? Because it’s the default way for different teams/deployment-units to communicate. So we use both Async/Brokered and Request-Response protocols, but the scope of the latter is limited, Ideally to a single deployment unit. So if you have a UI talking to a couple of back end services in the same deployment unit, REST/Json is often enough. But you want something that provides a stronger contract, particularly evolvability features between teams/deployment-units.

      Regarding GRPC I see that as a slightly different beast. It’s the broker which really adds the value here from a decoupling perspective (so more than just relieving backpressure). In fairness the pattern described here (Event Collaboration) can be performed with any async technology, but the decoupling benefits I mention only fully apply to brokered, async tech. Later in the series we get into more Kafka specific patterns, which do a better job (I hope) of elucidating the value of a broker that has good scalability and retention properties.


  4. Hi Ben,

    When you talk about asynchronous Request/Reply over Kafka for example “OrderRequested” and “OrderConfirmed”. How do you implement that with Kafka? In other message brokers you have headers or temporal queues/topics to cover that …


    1. Hi Fernando

      Yes Kafka doesn’t provide message headers yet, although they are likely to be added soon (see below), so you’d deserialise the message. Filter functionality is provided by the Streams API. You can run this as a separate component but for a use case like this you’d just do the filtration in the client service, either with streams or roll your own. So the Orders service subscribes to the stream of orders, reacting only to OrderConfirmed.


  5. Thank Ben, another great article!
    The principles all resonate with me – the event-only model has deep consistency and symmetry.
    The killer features for me are:
    * complete system replay, with no loss of fidelity and no audit black-holes (“oops, I forgot to log that step …”)
    * naturally scalable

    One thought:
    The choice of toolkit really helps (e.g. async jax-rs is perhaps not best choice – but vert.x has been perfect for interfacing with Kafka).

    1. Hi Fuzz – yeah replay is super useful. Event sourcing for the data you share 🙂 There is a post on just that on its way in fact (it’s the next but one). Haven’t tried vert.x – will have to check it out.

  6. A red flag for me with ‘Query by Event Propagation’ seems to be duplicity of data with the service that is not the owner of the given data’s CRUD operations. Eventual consistency is not easy to maintain in a micro-service environment and also in a way goes to make the micro-service not ‘micro’ anymore if it has to maintain states of data owned by several other services. Probably making it a macro-service environment 🙂

    1. Hi Kunal.

      The eventual consistency issues are actually one key reason why the single writer principal is important. But at the same time, if you have a service that composes data from several others, it’s hard to perform the required join at runtime. This tends to push you down this route.

      In fact there are a number of issues with the “event propagation” approach in practice. Later in this series we’ll be covering some more advanced mechanisms for doing this as we blend in stream processing. This really changes things.

  7. Hi, nice reading. I don’t fully agree in chapter Blending Events and Queries stating:
    So to add this stock check, the Orders Service would subscribe to the stream of Stock events, storing them locally in a database. It would then query this ‘view’ to validate there is sufficient stock

    We really distribute information from one domain to other. I would suggest that in pure event driven system I go with CQRS, so I use event driven querying as well. No domain data sharing and local caching if possible.

    I mainly use consumed events to update local entity status, but not store the actual .

    But probably there is nothing wrong with kind of local cache, so you have always available the count quickly and do not need to ask external service, definitively it will be faster, but to me asynchronous query will handle this in more pure way.

    1. Hi Ladislav,

      Yes there are effectively calls for both patterns in practice. One good example is where things are geographically distributed, so you want to move the data so that it’s closer to where you do the query. Another is where you want to retain more control (so you might want to pull data into a database and join it with other sources). But there is no right or wrong here, just different options with different tradeoffs.


  8. Thank you Ben for the great series. A question, in a poor network “environment”/scenario, what are your suggestions on distributed architecture, especially on Kafka framework?

    1. Hi David

      It depends a little on what you consider a poor network environment, but the best case I’ve seen is a ship (which is largely without high speed internet access). In this type of environment you replicate between Kafka clusters, actually in much the same way that you replicate between data centres in different geographical locations using Confluent Replicator (or Mirror Maker in Apache Kafka). This replication process can be offline for as long as the source cluster has storage, and then resync’s as and when there is connectivity.

      Currently Kafka Streams and the Producer only support in memory buffering. Another option is to use a local database and Kafka connect. You write to the database so you have local storage on the low-connectivity client then a Kafka Connect daemon to synchronise this database with the central Kafka cluster so you effectively buffer in the database (there is some related discussion here


  9. Great writeup Ben!

    When reading:
    > Here each color represents a topic, in Kafka, for Orders, Shipments and Payments. When a user clicks ‘buy’ it raises Order Requested, waiting for the Order Confirmed event before it responds back to the user.
    I went in solution mode, with asyn request and response pattern, how would you implement it? IMO a protocol like websocket should do the work. Is that what you will propose too?

    > we should be careful of divergence over time
    Also can you please elaborate how we can detect and deal with the divergence if using local store? If the service is unaware of the local store being behind it could causes big problems also hard to debug.

    1. Hi Eric, thanks for the question.

      For your first question, see this other post which comes with an associated code example. It doesn’t use a websocket. You could, but it takes a little more work.

      The question around divergence is a little more subtle. It’s not so much an issue of being behind, it’s more an issue that, over time, data becomes out of sync with the source. An extreme version of this is seen in many large companies where data is moved — often by many different means — from service to service. Over time, as schemas change, the “remote copies” in the various applications and services get corrupted and it can be hard to “reconcile” them back to the “golden source”. There is actually a whole industry built to support this problem (see Master Data Management).
      Using Kafka as “single source of truth” can help with these problems as data can be kept in the broker and a downstream application or service can always replay an entire dataset rather than constantly migrating the schema of the data it holds locally in its database. This works particularly well for smaller datasets: Customers, product catalog etc, but it’s used for larger ones too. See

      The issue of consistency is a little different, but is discussed in the “Building a Microservices Ecosystem…” post above. If the service is based on event streams only, being behind typically isn’t an issue. It can be an issue if you mix synchronous and asynchronous protocols though (e.g. calling the orders service directly + listen to async orders events -> potential for race conditions). Typically you design around this problem.

      1. Thanks Ben for reply.

        I think the key to your code example is the blocking HTTP GET in Orders Service. Websocket approach is more like a push model however it shares the same/similar request registry as yours (`Map outstandingRequests`) so it knows which client to push the response to.

        Thanks for elaborating divergence, I guess I meant to ask about the issue of inconsistency. Your example in _this_ post looks extremely close to Martin Kleppmann’s in “Designing Data Intensive Applications”.

        > For example, say a customer is purchasing an item that is priced in one currency but paid for in another currency. In order to perform the currency conversion, you need to know the current exchange rate. In the dataflow approach, the code that processes purchases would subscribe to a stream of exchange rate updates ahead of time, and record the current rate in a local database whenever it changes. When it comes to processing the purchase, it only needs to query the local database.

        In your post:
        > So to add this stock check, the Orders Service would subscribe to the stream of Stock events, storing them locally in a database. It would then query this ‘view’ to validate there is sufficient stock.

        My question is how do you make sure the local store is up-to-date? I might be missing some key concept here, does KTable ensure that (all I know is KTable is backed by compacted topic, but compaction doesn’t happen immediately once messages are appended so is KTable always the latest view)?

        This might be off the topic a bit that your reply reminds me of TW calling out an issue in their radar:
        Appreciate any advice so we won’t fall into the same trap again.

        1. Ha, yes, I believe martin took that example from a talk I did back in 2016 (he references it in that chapter I think).

          At a high level, when you’re designing this kind of system you’re deliberately avoiding global consistency, which gives you more degrees of freedom to optimise other elements of the system. So in CQRS, which is the pattern this orders service uses, we’re trading timeliness for better performance. So if you choose to read your own writes, your read request may be blocked, but if you have no such requirement you get better write performance because the system doesn’t block writes while the read view is updated.

          In the code example here, the ID is used to tie the two together. So you write an Order with ID=5. If you immediately send a read request for ID=5, if it doesn’t exist in the view, the view will block until order 5 is available or the call times out. This assumes orders have unique ids (or ids are versioned).

          A more general approach requires the caller to pass a logical timestamp to the view to determine if it is up to date. We plan to provide such a feature in the future.

          Regarding the websocket, it requires a distributed mapping in the general case so more complex than the : Map outstandingRequests, but conceptually similar. Again something I hope to formalise in the near future.

          Finally the ESB advice is not off topic at all. I should really write something more formally about this. The whole point of this series is to promote autonomy in the services we build, particularly with respect to data, but also by decoupling interactions.

          By turning the database inside out, you share only the event stream, the simplest possible abstraction for moving and holding data. All the smarts seen in Connect & Streams are embedded in the various services. So that’s to say each team retains control and autonomy and this is the key value gained from this style of implementation.

          What TW are referring to here is the centralised ESB pattern where one central team manages all those concerns, for everyone, so all parts of the inside-out database, which essentially takes you back to many of the disadvantages of a shared database — all teams end up moving in lockstep around the shared ESB with dictated schemas etc. etc.

          So I’m very glad TW chose to emphasise this point, and I should do more to promote this message myself, as I know people will naturally fall into this trap of “overarching centralised infrastructure” which in turn can end up defeating the original point of the endeavour.


          1. Thanks Ben again for taking the time to reply.

            Definitely look forward to your post regarding ESB advice. What we have done is taking one step further than turning the database inside out. We extract and stream business events on top of CDC so the events are not necessarily tied to database schemas. Then the key is the forwards and backwards compatible schemas. We are hoping to have more downstream services consume the topics. As you pointed out, we might naturally fall into the trap of centralised infrastructure.

            However on the other hand, if we don’t have some centralised “event bus”, instead have each service emit its own events, it definitely decouples the services compared to REST approach however it might be hard to manage. We might end up with too many topics (hard to discover) and overlapping events.

            I get the point of why people say ESB is anti pattern, again as you said because it’s managed by a central team so soon that team becomes the bottleneck. However there should be a line between the good (unified streams) and the bad (dependencies/bottleneck). I feel your diagram of the “Clustered Context Model” may address the problem in some way but am unclear how exactly it will be.

            Once again, look forward to your recommendation.

          2. “A more general approach requires the caller to pass a logical timestamp to the view to determine if it is up to date. We plan to provide such a feature in the future.

            Regarding the websocket, it requires a distributed mapping in the general case so more complex than the : Map outstandingRequests, but conceptually similar. Again something I hope to formalise in the near future.”

            Both are interesting ideas/topics, I was wondering if you have links to any follow up content on them?

  10. Thanks, these posts are blowing my mind!

    But I don’t know if I missed something but can’t understand the image bellow:×586.png

    Shipment Prepared wouldn’t produced by Shipment Service after consumes Order Confirmed? The arrow let me think that Shipment Prepared is consumed by Shipment Service.

    Also I’m confused about the two arrows between Shipment Service and Shipment Dispatched event, it’s produced and consumed?

  11. Hi,

    About the three mechanisms through which services interact (command, event, queries):

    How do you handle failures (exceptions) in the event chain of each service? Each entry point of these three can fail (lets, say, due to internal errors). If an error (exception) occurs, how does the entire event chain is affected?

    1. Hi Georgi

      There are a few different ways of doing this. If the process is idempotent (which is way easier to deal with), wrap the processing in a try catch and if you get an error set the output message to an error state. Write an admin script / or tool that lets you replay failed messages once the error is fixed. Another variant of this is to use a dead letter topic which you can use to offload unparsable or unprocessable messages, again these can be re-inserted back into the main topic once the error is fixed, or sent to an alerting system so you can resubmit etc.
      If you have a multi-stage process which is not idempotent, then you’ll need to use something like the sagas pattern to provide compensating transactions ( I’ve not actually come across anyone that took this route, but I hear that people do.


  12. Hi Ben,

    Just getting upto speed on this as we’re planning to use Kafka as a broker in my area.

    I just wondered, where two consumers of data wanted the same data in two different forms – e.g. FX rates vs USD to be expressed as USD/LCY or LCY/USD – and our data provider gives us only one way, presumably the transform must happen either upstream or downstream of Kafka and not in Kafka?

    I guess what I’m actually asking is whether enterprise data should be transformed to domain-specific formats by the consumers of that data or whether Kafka has a role there also?



    1. Typically you’d do this kind of transformation in Kafka Streams, either (a) to create a second topic with the inverted FX rate or (b) you might do it on read, in each consumer. It depends a bit who you want to own the logic: the publisher or the subscribers.

      It might be tempting to run a central service to do this. Typically that’s not a great idea if multiple teams are involved (as embedding business logic in shared infrastructure is pretty much where ESBs went wrong), but if it’s just one app it probably doesn’t matter.

  13. Hi Ben,

    Thanks for this interesting article.

    When designing APIs I always think about the question “who is the provider/contract-owner of the API”?

    Generally my way of thinking is, that for notifications the producer (sender) is the provider, because he notifies about something that happened in his area. On the other hand for commands, my opinion generally is, that the processer of the command is the provider.

    When thinking about using events instead of commands, my way of thinking doesn’t fit anymore.

    For the reply event (“orderConfirmed”) it still fits, but for the requesting event (“orderRequested”), it doesn’t. Because contract-owner is the order service, but producer is the basket service. For the payment and shipping service it’s different. They react on events of the order service.

    These different approaches have different impacts. In the “orderRequested” case the producers have to map their internal events to the order event. On the other hand, the payment service has to map the order event to his internal model. So if next to the order service, additional services would initiate a payment processing, it’s the payment service which has to map those different events to his internal model.

    So my resulting questions are:
    Is it a valid pattern, that the producer of an event is not the provider (contract-owner/ topic-owner)?
    When to choose which approach?


    1. Events are indeed a very different pattern to commands because they flip responsibility: each service just journals what it does and other services decide whether they should work in response to the event. In some ways “OrderRequested” is misleading, because it sounds a bit like a command. The intent is simply to record that some event happened: the user requested an order. Both event driven and command driven approaches are perfectly valid (the emphasis on events here is really because most people get commands already) and each comes with a different tradeoffs around ownership/coupling/plugability as you identify. So which to choose, and when, is a very good question to answer. I’d summarise it as:
      The basket service here effectively synthesises a command, but using events. If you only have request-response style operations like confirming an order then using commands is preferable to using events.
      However where the same events are used to drive asynchronous processing (e.g. shipping or re-pricing to take the examples in the post) the event driven model becomes more attractive because it decouples the shipping/repricing processes from the order creation one. this becomes very important to systems as they grow, particularly microservices where each deployable unit is relatively small, and even more so if you start looking at FaaS.
      I think much of the popularity of commands arise because we often build systems that start with a UI and hence favouring synchronous processing. Later we grow into asynchronous processing. If you know up front this is going to happen events are a better choice. Conversely if you think it’s never going to happen stick with commands.

  14. Hey Ben, let me start by complimenting you on all the posts in the series – amazing stuff! I am trying to use Kafka streams as the only building block of a micro-service event sourcing inspired architecture. I am, however, struggling to find a satisfactory solution to the following problem:

    Say I have a set of gateway services that are all behind a LB. A gateway will receive a HTTP request, perform lightweight validation on it and then send a command to a kafka topic A. An let’s also say that I have another service that is running a Kafka Streams app that would consume the aforementioned commands from topic A, process them, maybe update a state store etc.. and then produce an event in another topic B. If I want the HTTP gateway to be Restful and be able to read my own writes I would have to block and produce a response after either a timeout occurs or I receive the event from topic B. So the HTTP gateway would listen for events from topic B. But here comes the rub: since I have multiple such HTTP gateway services, how do I ensure that the gateway service that received the HTTP request and produced a command in topic A would also be the one to receive the corresponding event from topic B. I currently see three (not very good imo) solutions:
    1) Similar to, use a non-distributed cache to store all the outstanding requests (e.g. as it supports expiry) and make it so HTTP gateways subscribe to topic B from different consumer groups so each gateway receives all the messages (which I don’t quite like)
    2) Use a distributed in-memory cache (e.g. Hazelcast or reddis which also supports expiry of records) – this way the HTTP gateways don’t need to be in different consumer groups, they can subscribe to different partitions. Here the downside is that I would have to support an additional technology in the face of the distributed cache.
    3) Use something like Spring’s Synchronous Kafka:

    I fully realize that what I am trying to achieve is not the main use case of Kafka being an asynchronous processing system. However consider a concrete example of a Registration flow where the client (say a browser) submits a RegistrationRequest, the HTTP gateway validates it and produces a RegistrationCommand in a topic, a Kafka Streams app checks against a user state store whether the user already exists and creates it if not and then produces a UserRegistrationCompleted event (which might be successful or not). I don’t see how this can be implemented asynchronously. Any help would be greatly appreciated (:

    1. Thanks for the question.

      You say “make it so HTTP gateways subscribe to topic B from different consumer groups so each gateway receives all the messages (which I don’t quite like)”

      You shouldn’t have to use individual consumer groups. So long as you key the request by a request ID or correlation ID then the response will come back to the same node that sent it (with all nodes in the same consumer group) – so you shouldn’t need a distributed cache, local one is fine. If there is a rebalance for any reason you could get timeouts but that’s usually fine.

      Note the use of the KStreams metastore to reroute requests to their key owner:

  15. Hi Ben,
    Thanks for a very informative article. I had few comments/doubts.

    In the Query by Event-Carried State Transfer, Orders Service, keeps a copy of stock events published by Stock Service. For this to work, Stock service first has to start publishing all the relevant events as and when the state of stock changes in the Stock service (application). Is this always doable? Doesn’t this put an extra burden on each service to not only maintain its own state but also publish all of that to a central event store (Kafka). How does Stock service know/decide what is relevant information (granularity) that needs to be published?

    In the Single-Writer principle section, we are saying a useful principle to apply with this style of system is to assign responsibility for propagating events of a specific type, to a single service. How does Order service guarantee, what is needed (all the information) by Payment Service is available in the Order Validated event? Especially when we say, in the Receiver Driven Flow Control style, Logic is pushed to the receiver of the events, rather than the sender to flip the burden of responsibility. Event based service communication diagram hints at all communication happening through events. When a new service wants to participate in this process flow, what is the easiest way for this new service owner to find what type of events are available for consumption (relevant) and what events when published by this new service will add further value to the process?


    1. Hi Rahul.

      Thanks for the thoughtful questions.

      > How does Stock service know/decide what is relevant information (granularity) that needs to be published?
      This is a much bigger question and the answer is actually pretty domain specific but as a rule of thumb you publish what you receive. The tricky part is that as soon as you publish events you’re coupled to the contract those events provide to downstream systems. This makes it harder to change your internal system.
      In most stable systems this isn’t too bad though (i.e. it gets easier over time as breaking changes are less frequent) but it’s a bit of a dichotomy where there is no perfect solution.

      Also you often don’t want to publish something that is so raw that it’s not useful. So maybe you have a mortgage application proces and the mortguage get’s created over ten different steps. You’d have events for each step, but you may not want to publish such fine grained events downstream because their hard to consume. In this case you’d typically create a single aggregated mortguage event at the end of the process.

      This kind of second stage process is often called an anti corruption layer because it also insulates their internal data/domain model from the published one. This is sensible, but it’s also a bit more work.

      You also asked about the burden of maintaining your own state and the published state. Yes this is a greater burden if these are decoupled (and actually also if they are not but for different reasons). You can use event sourcing / CQRS / write through as disucssed later in this series ( CQRS is great if your system lends itself to eventual consistency. If not write through is typically best. There are also more involved patterns with CDC like the outbox patten which can help ( and If you’re working with a relational database bi-temporal is a great pattern and is supported in most relational databases natively these days (

      >How does Order service guarantee, what is needed (all the information) by Payment Service is available in the Order Validated event?
      This is a process question really i.e. you need to have a conversation with the other services. Event Storming is a great way to nail this down early. But the same question must be answered regardless of whether you use events or apis.

      >When a new service wants to participate in this process flow, what is the easiest way for this new service owner to find what type of events are available for consumption (relevant) and what events when published by this new service will add further value to the process?
      The schema registry provides details of what events are published on what topics. Confluent will be releasing a better UI for this in the future which includes the ability to search schemas etc. But the best way to manage schemas today is actually in github. This provies a repo of available event schemas, you can propose changes to those events and have your downstream consumers +1 them. Consumers can also propose PRs for things they need.

      Hope that helps


Try Confluent Platform

Download Now

We use cookies to understand how you use our site and to improve your experience. Click here to learn more or change your cookie settings. By continuing to browse, you agree to our use of cookies.