Chapter 6.
Version Stamps
Many critics of NoSQL databases focus on the lack of support for transactions. Transactions are a useful
tool that helps programmers support consistency. One reason why many NoSQL proponents worry less
about a lack of transactions is that aggregate-oriented NoSQL databases do support atomic updates
within an aggregate—and aggregates are designed so that their data forms a natural unit of update.
That said, it’s true that transactional needs are something to take into account when you decide what
database to use.
As part of this, it’s important to remember that transactions have limitations. Even within a
transactional system we still have to deal with updates that require human intervention and usually
cannot be run within transactions because they would involve holding a transaction open for too long.
We can cope with these using version stamps—which turn out to be handy in other situations as well,
particularly as we move away from the single-server distribution model.
6.1. Business and System Transactions
The need to support update consistency without transactions is actually a common feature of systems
even when they are built on top of transactional databases. When users think about transactions, they
usually mean business transactions. A business transaction may be something like browsing a product
catalog, choosing a bottle of Talisker at a good price, filling in credit card information, and confirming
the order. Yet all of this usually won’t occur within the system transaction provided by the database
because this would mean locking the database elements while the user is trying to find their credit card
and gets called off to lunch by their colleagues.
Usually applications only begin a system transaction at the end of the interaction with the user, so that
the locks are only held for a short period of time. The problem, however, is that calculations and
decisions may have been made based on data that’s changed. The price list may have updated the price
of the Talisker, or someone may have updated the customer’s address, changing the shipping charges.
Some databases provide a similar mechanism of conditional update that allows you to ensure updates
won’t be based on stale data. You can do this check yourself, although you then have to ensure no other
thread can run against the resource between your read and your update. (Sometimes this is called a
compare-and-set (CAS) operation, whose name comes from the CAS operations done in processors. The
difference is that a processor CAS compares a value before setting it, while a database conditional
update compares a version stamp of the value.)
There are various ways you can construct your version stamps. You can use a counter, always
incrementing it when you update the resource. Counters are useful since they make it easy to tell if one
version is more recent than another. On the other hand, they require the server to generate the counter
value, and also need a single master to ensure the counters aren’t duplicated.
Another approach is to create a GUID, a large random number that’s guaranteed to be unique. These
use some combination of dates, hardware information, and whatever other sources of randomness they
can pick up. The nice thing about GUIDs is that they can be generated by anyone and you’ll never get a
duplicate; a disadvantage is that they are large and can’t be compared directly for recentness.
A third approach is to make a hash of the contents of the resource. With a big enough hash key size, a
content hash can be globally unique like a GUID and can also be generated by anyone; the advantage is
that they are deterministic—any node will generate the same content hash for same resource data.
However, like GUIDs they can’t be directly compared for recentness, and they can be lengthy.
A fourth approach is to use the timestamp of the last update. Like counters, they are reasonably short
and can be directly compared for recentness, yet have the advantage of not needing a single master.
Multiple machines can generate timestamps—but to work properly, their clocks have to be kept in sync.
One node with a bad clock can cause all sorts of data corruptions. There’s also a danger that if the
timestamp is too granular you can get duplicates—it’s no good using timestamps of a millisecond
precision if you get many updates per millisecond.
6.2. Version Stamps on Multiple Nodes
The basic version stamp works well when you have a single authoritative source for data, such as a
single server or master-slave replication. In that case the version stamp is controlled by the master. Any
slaves follow the master’s stamps. But this system has to be enhanced in a peer-to-peer distribution
model because there’s no longer a single place to set the version stamps.
If you’re asking two nodes for some data, you run into the chance that they may give you different
answers. If this happens, your reaction may vary depending on the cause of that difference. It may be
that an update has only reached one node but not the other, in which case you can accept the latest
(assuming you can tell which one that is). Alternatively, you may have run into an inconsistent update, in
which case you need to decide how to deal with that. In this situation, a simple GUID or etag won’t
suffice, since these don’t tell you enough about the relationships.
The simplest form of version stamp is a counter. Each time a node updates the data, it increments the
counter and puts the value of the counter into the version stamp. If you have blue and green slave
replicas of a single master, and the blue node answers with a version stamp of 4 and the green node
with 6, you know that the green’s answer is more recent.
In multiple-master cases, we need something fancier. One approach, used by distributed version control
systems, is to ensure that all nodes contain a history of version stamps. That way you can see if the blue
node’s answer is an ancestor of the green’s answer. This would either require the clients to hold onto
version stamp histories, or the server nodes to keep version stamp histories and include them when
asked for data. This also detects an inconsistency, which we would see if we get two version stamps and
neither of them has the other in their histories. Although version control systems keep these kinds of
histories, they aren’t found in NoSQL databases.
A simple but problematic approach is to use timestamps. The main problem here is that it’s usually
difficult to ensure that all the nodes have a consistent notion of time, particularly if updates can happen
rapidly. Should a node’s clock get out of sync, it can cause all sorts of trouble. In addition, you can’t
detect write-write conflicts with timestamps, so it would only work well for the single master case—and
then a counter is usually better.
The most common approach used by peer-to-peer NoSQL systems is a special form of version stamp
which we call a vector stamp. In essence, a vector stamp is a set of counters, one for each node. A vector
stamp for three nodes (blue, green, black) would look something like [blue: 43, green: 54, black: 12].
Each time a node has an internal update, it updates its own counter, so an update in the green node
would change the vector to [blue: 43, green: 55, black: 12]. Whenever two nodes communicate, they
synchronize their vector stamps. There are several variations of exactly how this synchronization is
done. We’re coining the term “vector stamp” as a general term in this book; you’ll also come across
vector clocks and version vectors—these are specific forms of vector stamps that differ in how they
synchronize.
By using this scheme you can tell if one version stamp is newer than another because the newer stamp
will have all its counters greater than or equal to those in the older stamp. So [blue: 1, green: 2, black: 5]
is newer than [blue:1, green: 1, black 5] since one of its counters is greater. If both stamps have a
counter greater than the other, e.g. [blue: 1, green: 2, black: 5] and [blue: 2, green: 1, black: 5], then you
have a write-write conflict. There may be missing values in the vector, in which case we use treat the
missing value as 0. So [blue: 6, black: 2] would be treated as [blue: 6, green: 0, black: 2]. This allows you
to easily add new nodes without invalidating the existing vector stamps.
6.3. Key Points
• Version stamps help you detect concurrency conflicts. When you read data, then update it, you can
check the version stamp to ensure nobody updated the data between your read and write.
• Version stamps can be implemented using counters, GUIDs, content hashes, timestamps, or a
combination of these.
• With distributed systems, a vector of version stamps allows you to detect when different nodes have
conflicting updates.