Save 25% or More on Your Kafka Costs | Take the Confluent Cost Savings Challenge
Microservices are all the rage these days. Passionate, thoughtful advocates and detractors present compelling arguments for and against the architectural style. Usually, these arguments boil down to whether organizations should adopt, refrain from, or abandon microservices architecture based on an assessment of the costs and benefits of the style for the size and technical culture maturity of those organizations.
However, such an adopt-or-abandon premise overlooks an important dimension of the microservices conversation: what style of microservices would map most closely to the needs of the organization? Perhaps considered in this light, the costs that would deter certain organizations might be mitigated and the benefits retained by a better style of microservices.
Perhaps the most common shared understanding of microservices is represented by the Death Star diagram, which describes a network of services and their synchronous, run-time dependencies on other services. Much of the microservices mindshare—writing, diagramming and designing, libraries and tooling—assumes this model of small services calling each other synchronously, usually via HTTP. And it is usually this model that is the subject of the adopt-or-abandon assessments described above. This conflating of the underlying values and goals of microservices with a specific model for their implementation is unfortunate.
AWS Death Star diagram, circa 2008 as per Werner Vogels tweet
This view of microservices shares much in common with object-oriented programming: encapsulated data access and mutable state change are both achieved via synchronous calls, the web of such calls among services forming a graph of dependencies. Programmers can and should enjoy a lively debate about OO’s merits and drawbacks for organizing code within a single memory and process space. However, when the object-oriented analogy is extended to distributed systems, many problems arise: latency which grows with the depth of the dependency graph, temporal liveness coupling, cascading failures, complex and inconsistent read-time orchestration, data storage proliferation and fragmentation, and extreme difficulty in reasoning about the state of the system at any point in time.
Luckily, another programming style analogy better fits the distributed case: functional programming. Functional programming describes behavior not in terms of in-place mutation of objects, but in terms of the immutable input and output values of pure functions. Such functions may be organized to create a dataflow graph such that when the computation pipeline receives a new input value, all downstream intermediate and final values are reactively computed. The introduction of such input values into this reactive dataflow pipeline forms a logical clock that we can use to reason consistently about the state of the system as of a particular input event, especially if the sequence of input, intermediate, and output values is stored on a durable, immutable log.
(Of course, the application of these single-memory-space programming styles as analogies to distributed architectures is not a rigorous attempt to extend their underlying formal models, but is rather an informal extension of their basic intuition and principles.)
What would a microservices architecture following this functional programming analogy look like? One possibility is the Commander Pattern, of which Capital One has provided an open-source reference implementation:
In this model, an action (in the form of a command from outside or event from inside the bounded context) is written down in its raw, denormalized, business-domain form to an immutable log. Various topologies of functions (dataflows) react to process this new event, eventually incorporating its various data-domain consequences into aggregations representing current state. Clients of the service (and the various processors themselves) perceive/query the current system state via these aggregations.
Though, as observed above, much of the microservices mind-share is devoted to the OO-analogy synchronous style, there are some battle-hardened tools and techniques for implementing the FP-analogy style.
In the model above, action (state-changing writes) flows through a separate system path from perception (non-state-changing reads), which is an example of Command Query Responsibility Segregation (CQRS).
System state is derived from the log of these actions, which is an example of Event Sourcing.
These two architectural techniques from Domain Driven Design (DDD) often go hand-in-hand, and happen to fit well within this functional programming view of microservices.
In this FP architecture, many services perceive and react to new values on the log, because log consumption is ubiquitous and cheap. However, writing to the log must be handled much more carefully, as Ben Stopford observed in the section “The Single Writer Principle” of Build Services on a Backbone of Events. Other log-centric, CQRS systems that I’ve worked on, such as the database Datomic also follow this principle.
Within this architecture, just as functions within a functional program, different microservices in the topology will serve different purposes. Most of the processing nodes in the topology will act as “pure” functional transforms, i.e. accepting new inputs from some set of input logs, doing some math according to the business rules (possibly involving a local aggregation for stateful computations like joins, windowed aggregates, etc.), and emitting new values onto some set of output logs. These form the “Functional Core” of the system.
The other two types of microservices act as the “Imperative Shell” of the system, executing certain side-effects. The first category of these are shared aggregations within the bounded context. These services build useful “materialized views” of the current state of the system, such as full-text indexes, entity-oriented REST services, attributed oriented BI/analysis services, analytics dashboard views, etc.
The second type of these “Imperative Shell” services execute side-effects outside of the bounded context, such as sending an email, SMS, or mobile push notification to a user, or calling out to some other external service. These side-effects must be managed carefully, accounting for target service downtime, retries, idempotency, and the storage of the results of such calls as applicable. The record of external actions and the corresponding results may be used to implement Sagas or distributed rollback/reversal.
Kafka has exactly the characteristics needed to form the backbone of such a FP-analogy microservices architecture. It both durably stores and reactively conveys immutable values, making it the ideal place to record input, intermediate, and output values for functional dataflow pipelines.
Kafka’s Streams API provides powerful primitives for implementing precisely such distributed computational pipelines. Kafka Streams provides a high-level API for building topologies of transforms (e.g. map and filter), local and shared aggregations (e.g. Global Tables and Stores API, and side-effect processors like foreach, peek, and even map for storing results). Via the aggregations described above, Kafka Streams also seamlessly unifies stream processing with interactive queries, potentially making each processing microservice also a read-only view of its current state via e.g. REST or RPC.
Kafka’s Connect framework also provides a simple and powerful way of building shared aggregations in external systems like RDBMSes, full-text indexes, document-oriented stores, key-value stores, graph databases, column-oriented or star-schema analytics databases, or whatever most closely matches the desired data-access patterns of your services.
Let’s explore the ideas above using a real-life (though somewhat simplified) example from a banking use-case at Capital One, created by our colleague Qiang Xue (much of this section is adapted from his summary, with his permission and our thanks).
Our customer wants to be able to set spending limits on particular debit or credit card numbers for their account (e.g. in order to rein in Junior’s spending at college). In order to authorize a financial transaction on that card number (e.g. a debit or credit card swipe), the authorizing process must maintain a running balance of the spending on that card number.
Note: The code examples below target version 0.11 of Kafka’s Streams API. Since this article was first written, the Streams API changed slightly in the latest 1.0 release.
For illustrative purposes, let’s assume that the transactions are accessible through a Kafka topic. We can write our main stream processing code in Java:
In fact, the actual implementation of this example is much more complicated:
But even when adding in these complications, the service boundaries, topics, and code stay relatively straightforward (the full code listing was even included in a draft of this post!).
Both the transform and aggregation service types are obvious in the example above, but we can also easily imagine a side-effect service which observed calculated spending balances, and notifies customers via mobile push or SMS when the preset spending limit is reached. In line with the microservices value of coordination avoidance, such a notification service could be added later without any knowledge or involvement from the team that implemented the business logic.
The functional style of microservices enabled by Kafka and its Streams API have helped this service to meet its goals in production:
Several other applications from various product lines are being migrated to this functional streaming style.
Finally, let’s compare and contrast the FP-style of microservices described above with the ubiquitous OO-style along several important dimensions: data access patterns, orchestration patterns, handling failure, write-time and read-time latency, reasoning about state over time, and dealing with side-effects.
In the OO-style of microservices, the service boundaries are usually drawn such that each service encapsulates a relatively small data domain (sometimes just a single type of entity). This fragmentation has some negative consequences:
On the other hand, the FP-style we’ve presented:
The OO-style of microservices which encapsulate data require lots of read-time orchestration in order to put together aggregate queries across services.
Client asks the Accounts_API a question, so
Accounts_API turns around and asks the
Transactions_µS and the
The reactive FP-style does all of this orchestration at write-time. As new facts are recorded in Kafka, the various topologies eagerly do their respective processing.
The orchestration pattern described above has another shortcoming: availability in the face of failure. The synchronously communicating services are temporally coupled to each others’ runtime liveness, i.e. the
Balances_µS has to be alive at the time the
Accounts_API calls it in order for the
Accounts_API to be fully available. The problem compounds exponentially as the call graph deepens. Techniques like circuit-breakers, etc. may be employed to paper-over this dependency, but the underlying problem remains.
This shortcoming is even worse at write-time, since the write-orchestrating service’s sequence of actions is likely not atomic, i.e. Accounts_API might die or encounter an unavailable target service mid-way through a complex orchestration of writes to its various dependencies, leaving the system in an inconsistent state or subject to complex and error-prone roll-back logic.
In the FP-style, and especially since transactional writes were introduced in Kafka 0.11, writes are simple, atomic (either all fail or all succeed), and carefully managed (single writer principle described above). Writes to Kafka may be tuned for a desired consistency level, and do depend on Kafka’s availability, which is likely to be much better than the availability of any given microservice.
Reads are not coupled to the availability of upstream writers, or even to that of Kafka, since reads are served from a locally managed, read-optimized, materialized view. Reads from any given service can proceed for an indefinite amount of time, and will always be consistent as of some point in time, with clients understanding that that point in time might lag behind the latest upstream event (and clients can decide if that’s good enough). If this AP tradeoff is unacceptable for a service’s use-case, that service can choose to return errors or become unavailable when it falls behind. Aggregation microservices control their own availability and consistency trade-offs.
In the sequence diagrams above, the OO-style is obviously much more latent at read time than the FP example, since it has to call through a two-layer deep graph of latency:
Client->Accounts_API + max(Accounts_API->Transactions_µS, Accounts_API->Balances_µS). Why wait until a client asks a question before computing the answer? Why incur so many hops of HTTP latency?
In the FP-style, writes are very fast (since it’s just writing down a command/event to Kafka), then there is some intervening processing latency while this new fact is incorporated into the system, and then reads are very fast because they’re served out of a purpose-built, read-only materialized view.
The OO-style orchestration model described above also introduces great difficulty in maintaining consistency and in reasoning about the state of the system over time.
At write-time, commands are constantly coming into the various services and are being recorded locally without any whole-system sequencing of these commands. This makes validating consistency a burdensome and error-prone task during write-time orchestration (e.g. two-phase commit between separate services, proper sequencing and error recovery, unknown atomicity in face of mid-process service failure, etc.)
At read time, a new transaction may arrive between the call
Accounts_API->Transactions_µS and Accounts_API->Balances_µS such that the transaction is present in the
Transactions_µS response, but the impact on balance hasn’t landed in the
Balances_µS. The response from
Accounts_API would then be inconsistent.
In the FP-style, however, there is an explicit notion of logical time in the ordering of the events in the log, which creates both a clear provenance of derived facts tracing back to source facts, and the ability to reason about the state of the system as of any particular source event. As of version 0.11, Kafka and its Streams API support exactly-once processing semantics, ensuring that the processing pipelines work properly under failure.
Reads in this FP-style will always be consistent as of some valid point in logical time, but any given view might not yet have incorporated every available event.
In the OO-style, each call from one service to another is possibly a side-effect, depending on the safety semantics of the particular call. Service orchestration is the process of sequencing these side-effects with local business rule processing to produce the desired end state. This pervasive intermingling of local processing with side-effecting calls to other services makes it very difficult to roll back and prevent duplicate calls during failures.
In the FP-style, as discussed above in “Three categories of microservices”, the pure functional transforms and local aggregations that comprise the “functional core” of the system have no side-effects (i.e. other than writing to local storage and writing to Kafka which are carefully managed by Kafka Streams to have the proper “functional” semantics). Side-effects to targets outside the bounded context (like the notification service invocation cited above) are executed carefully by purpose-built services in the “imperative shell” of the system when triggered by an upstream immutable value added to the log.
Before either rejecting microservices architecture as too complicated, or implementing it blindly according to the grooves cut by early adopters and popular stacks of tooling, consider what type of system will meet your needs, and whether the reactive, immutable, functional style of microservices enabled by Kafka and Kafka Streams might be a better fit.
Learn more about best practices and technical patterns of Capital One’s deployment of Apache Kafka in Capital One Delivers Risk Insights in Real Time with Stream Processing, an online talk featuring Ravi Dubey, Senior Manager, Software Engineering, and Jeff Sharpe, Senior Software Engineer.
Companies are looking to optimize cloud and tech spend, and being incredibly thoughtful about which priorities get assigned precious engineering and operations resources. “Build vs. Buy” is being taken seriously again. And if we’re honest, this probably makes sense. There is a lot to optimize.
Operating Kafka at scale can consume your cloud spend and engineering time. And operating everyday tasks like scaling or deploying new clusters can be complex and require dedicated engineers. This post focuses on how Confluent Cloud is 1) Resource Efficient, 2) Fully Managed, and 3) Complete.