[Webinar] Deliver enterprise-grade Apache Kafka® to your customers | Join Now
This is an edited and expanded transcript of a talk I gave at Strange Loop 2014. The video recording (embedded below) has been watched over 8,000 times. For those of you who prefer reading, I thought it would be worth writing down the talk.
Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?
A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts. You can query it at some point in time — but that’s still old, imperative style thinking. A more fruitful approach is to take the streams of facts as they come in, and functionally process them in real-time.
This talk introduces Apache Samza, a distributed stream processing framework developed at LinkedIn. At first it looks like yet another tool for computing real-time analytics, but it’s more than that. Really it’s a surreptitious attempt to take the database architecture we know, and turn it inside out.
At its core is a distributed, durable commit log, implemented by Apache Kafka. Layered on top are simple but powerful tools for joining streams and managing large amounts of data reliably.
What do we have to gain from turning the database inside out? Simpler code, better scalability, better robustness, lower latency, and more flexibility for doing interesting things with data. After this talk, you’ll see the architecture of your own applications in a new light.
This talk is about database architecture and application architecture. It’s somewhat related to an open source project I’ve been working on, called Apache Samza. I’m Martin Kleppmann, and I was until recently at LinkedIn working on Samza. At the moment I’m taking a sabbatical to write a book for O’Reilly, called Designing Data-Intensive Applications.
Let’s talk about databases. What I mean is not any particular brand of database — I don’t mind whether you’re using relational, or NoSQL, or whatever. I’m really talking about the general concept of a database, as we use it when building applications.
Take, for example, the stereotypical web application architecture:
You have a client, which may be a web browser or a mobile app, and that client talks to some kind of server-side system (which you may call a “backend” or whatever you like). The backend typically implements some kind of business logic, performs access control, accepts input, produces output. When the backend needs to remember something for the future, it stores that data in a database, and when it needs to look something up, it queries a database. That’s all very familiar stuff.
The way we typically build these sorts of applications is that we make the backend layer stateless. That has a lot of advantages: you can scale out the backend by just running more processes in parallel, and you can route any request to any backend instance (they are all equally well qualified to handle the request), so it’s easy to spread the load across multiple machines. Any state that is required to handle a request will be looked up from the database on each request. That also works nicely with HTTP, since HTTP is a stateless protocol.
However, the big problem with this approach is: the state has to go somewhere, and so we have to put it in the database. We are now using the database as a kind of gigantic, global, shared, mutable state. It’s like a global variable that’s shared between all your application servers. It’s exactly the kind of horrendous thing that, in shared-memory concurrency, we’ve been trying to get rid of for ages. Actors, channels, goroutines, etc. are all attempts to get away from shared-memory concurrency, avoiding the problems of locking, deadlock, concurrent modifications, race conditions, and so on.
We’re trying to get away from shared-memory concurrency, but with databases we’re still stuck with this big, shared, mutable state. So it’s worth thinking about this: if we’re trying to get rid of shared memory in our single-process application architecture, what would happen if we tried to get rid of this shared mutable state on a whole-system level?At the moment, it seems to me that the main reason why systems are still being built with mutable databases is just inertia: that’s the way we’ve building applications for decades, and we don’t really have good tools to do it differently. So, let’s think about what other possibilities we have for building stateful systems.
In order to try to figure out what routes we could take, I’d like to look at four different examples of things that databases currently do, and things that we do with databases. And these four examples might give us an indicator of the directions in which we could take these systems forward in future.
The first example I’d like to look at is replication. You probably know about the basics of replication: the idea is that you have a copy of the same data on multiple machines (nodes), so that you can serve reads in parallel, and so that the system keeps running if you lose a machine.
It’s the database’s job to keep those replicas in sync. A common architecture for replication is that you send your writes to one designated node (which you may call the leader, master or primary), and it’s the leader’s responsibility to ensure that the writes are copied to the other nodes (which you may call followers, slaves or standbys). There are also other ways of doing it, but leader-based replication is familiar — many systems are built that way.
Let’s look at an example of replication to see what’s actually happening under the hood. Take a shopping cart, for instance.
This is using a relational data model, but the same principles apply with other data models too. Say you have a table with three columns: customers, products, and quantity. Each row indicates that a particular customer has a particular quantity of a particular product in their shopping cart.
Now say customer 123 changes their mind, and instead of wanting quantity 1 of product 999, they actually want quantity 3 of that product. So they issue an update query to the database, which matches the row for customer 123 and product 999, and it changes the value of the quantity column from 1 to 3.
The result is that the database overwrites the quantity value with the new value, i.e. it applies the update in the appropriate place.
Now, I was talking about replication. What does this update do in the context of replication? Well, first of all, you send this update query to your leader, and it executes the query, figures out which rows match the condition, and applies the write locally:
Now, how does this write get applied to the other replicas? There are several different ways how you can implement replication. One option is to send the same update query to the follower, and it executes the same statement on its own copy of the database. Another option is to ship the write-ahead log from the leader to the follower.
A third option for replication, which I’ll focus on here, is called a logical log. In this case, the leader writes out the effect that the query had — i.e. which rows were inserted, updated or deleted — like a kind of diff. For an update, like in this example, the logical log identifies the row that was changed (using a primary key or some kind of internal tuple identifier), gives the new value of that row, and perhaps also the old value.
This might seem like nothing special, but notice that something interesting has happened here:
At the top we have the update statement, an imperative statement describing the state mutation. It is an instruction to the database, telling it to modify certain rows in the database that match certain conditions.
On the other hand, when the write is replicated from the leader to the follower as part of the logical log, it takes a different form: it becomes an event, stating that at a particular point in time, a particular customer changed the quantity of a particular product in their cart from 1 to 3. And this is a fact — even if the customer later removes the item from their cart, or changes the quantity again, or goes away and never comes back, that doesn’t change the fact that this state change occurred. The fact always remains true.
This distinction between an imperative modification and an immutable fact is something you may have seen in the context of event sourcing. That’s a method of database design that says you should structure all of your data as immutable facts, and it’s an interesting idea.
However, what I’m saying here is: even if you use your database in the traditional way, overwriting old state with new state, the database’s internal replication mechanism may still be translating those imperative statements into a stream of immutable events.
Hold that thought for now: I’m going to talk about some completely different things, and return to this idea later.The second one of the four things I want to talk about is secondary indexing. You’re probably familiar with secondary indexes — they are the bread and butter of relational databases. Using the shopping cart example again:
You have a table with different columns, and you may have several different indexes on that table in order to be able to efficiently find rows that match a particular query. For example, you may run some SQL to create two indexes: one on the customer_id column, and a separate index on the product_id column.
Using the index on customer_id you can then efficiently find all the items that a particular customer has in their cart. Using the index on product_id you can efficiently find all the carts that contain a particular product.
What does the database do when you run one of these CREATE INDEX queries?
The database scans over the entire table, and it creates an auxiliary data structure for each index. An index is a data structure that represents the information in the base table in some different way. In this case, the index is a key-value-like structure: the keys are the contents of the column that you’re indexing, and the values are the rows that contain this particular key.
Put another way: to build the index for the customer_id column, the database takes all the values that appear in that column, and uses them as keys in a dictionary. A value points at all of the occurrences of that value — for example, the index entry 123 points at all of the rows which have a customer_id of 123. Similarly for the other index.
The important point here is that the process of going from the base table to the indexes is completely mechanical. You simply tell the database that you want a particular index to exist, and it goes away and builds that index for you.The index doesn’t add any new information to the database — it just represents the existing data in a different structure. (Put another way, if you drop the index, that doesn’t delete any data from your database.) It’s a redundant data structure that only exists to make certain queries faster. And that data structure can be entirely derived from the original table.
Creating an index is essentially a transformation which takes a database table as input, and produces an index as output. The transformation consists of going through all the rows in the table, picking out the field that you want to index, and restructuring the data so that you can look up by that field. That transformation is built into the database, so you don’t need to implement it yourself. You just tell the database that you want an index on a particular field to exist, and it does all the work of building it.
Another great thing about indexes: whenever the data in the underlying table changes, the database automatically updates the indexes to be consistent with the new data in the table. In other words, this transformation function which derives the index from the original table is not just applied once when you create the index, but applied continuously.
With many databases, these index updates are even done in a transactionally consistent way. This means that any later transactions will see the data in the index in the same state as it is in the underlying table. If a transaction aborts and rolls back, the index modifications are also rolled back. That’s a really great feature which we often don’t appreciate!What’s even better is that some databases let you build an index at the same time as continuing to process write queries. In PostgreSQL, for example, you can say CREATE INDEX CONCURRENTLY. On a large table, creating an index could take several hours, and on a production database you wouldn’t want to have to stop writing to the table while the index is being built. The index builder really needs to be a background process which can run while your application is simultaneously reading and writing to the database as usual.
The fact that databases can do this is quite impressive. After all, to build an index, the database has to scan the entire table contents, but those contents are changing at the same time as the scan is happening. The index builder is tracking a moving target. At the end, the database ends up with a transactionally consistent index, despite the fact that the data was changing concurrently.
In order to do this, the database needs to build the index from a consistent snapshot at one point in time, and also keep track of all the changes that occurred since that snapshot while the index build was in progress. That’s a really cool feature.
So far we’ve discussed two aspects of databases: replication and secondary indexing. Let’s move on to number 3: caching.What I’m talking about here is caching that is explicitly done by the application. (You also get caching happening automatically at various levels, such as the operating system’s page cache and the CPU caches, but that’s not what I’m talking about here.)
Say you have a website that becomes popular, and it becomes too expensive or too slow to hit the database for every web request, so you introduce a caching layer — often using memcached or Redis or something of that sort. And often this cache is managed in application code, which typically looks something like this:
When a request arrives at the application, you first look in a cache to see whether the data you want is already there. The cache lookup is typically by some key that describes the data you want. If the data is in the cache, you can return it straight to the client.
If the data you want isn’t in the cache, that’s a cache miss. You then go to the underlying database, and query the data that you want. On the way out, the application also writes that data to the cache, so that it’s there for the next request that needs it. The thing it writes to the cache is whatever the application would have wanted to see there in the first place. Then the application returns the data to the client.
This is a very common pattern, but there are several big problems with it.
The first problem is that clichéd quote about there being only two hard problems in computer science (which I can’t stand any more). But seriously, if you’re managing a cache like this, then cache invalidation really is tricky. When data in the underlying database changes, how do you know what entries in the cache to expire or update? One option is to have an expiry algorithm which figures out which database change affects which cache entries, but those algorithms are brittle and error-prone. Alternatively, you can just have a time-to-live (expiry time) and accept that you sometimes read stale data from the cache, but such staleness is often unacceptable.
Another problem is that this architecture is very prone to race conditions. For example, say you have two processes concurrently writing to the database and also updating the cache. They might update the database in one order, and the cache in the other order, and now the two are inconsistent. Or if you fill the cache on read, you may read and write concurrently, and so the cache is updated with a stale value while the concurrent write is occurring. I suspect that most of us building these systems just pretend that the race conditions don’t exist, because they are just too much to think about.
A third problem is cold start. If you reboot your memcached servers and they lose all their cached contents, suddenly every request is a cache miss, the database is overloaded because of the sudden surge in requests, and you’re in a world of pain. If you want to create a new cache, you need some way of bootstrapping its contents without overloading other parts of the system.
So, here we have a contrast: on the one hand, creating a secondary index in a database is beautifully simple, one line of SQL — the database handles it automatically, keeping everything up-to-date, and even making the index transactionally consistent. On the other hand, application-level cache maintenance is a complete mess of complicated invalidation logic, race conditions and operational problems.
Why should it be that way? Secondary indexes and caches are not fundamentally different. We said earlier that a secondary index is just a redundant data structure on the side, which structures the same data in a different way, in order to speed up read queries. A cache is just the same.
If you think about it, a cache is also the result of taking your data in one form (the form in which it’s stored in the database) and transforming it into a different form for faster reads. In other words, the contents of the cache are derived from the contents of the database.
We said that a secondary index is built by picking out one field from every record, and using that as the key in a dictionary. In the case of a cache, we may apply an arbitrary function to the data: the data from the database may have gone through some kind of business logic or rendering before it’s put in the cache, and it may be the result of joining several records from different tables. But the end result is similar: if you lose your cache, you can rebuild it from the underlying database; thus, the contents of the cache are derived from the database.
In a read-through cache, this transformation happens on the fly, when there is a cache miss. But we could perhaps imagine making the process of building and updating a cache more systematic, and more similar to secondary indexes. Let’s return to that idea later.
I said I was going to talk about four different aspects of database. Let’s move on to the fourth: materialized views.
You may already know w