I just finished implementing a non-trivial bit of Pijul: cross-process locks for Sanakirja databases. In this post, I explain how it works.
Sanakirja is essentially a copy-on-write collection of B trees in a file. We can start mutable transactions, which allow us to change the database, and can be dropped (after which the database remains unmodified) or committed, and immutable transactions, which only allow us to read the database.
There are a few rules to prevent race conditions: there can only be one mutable transaction started at a time, and mutable transactions cannot start if an immutable transaction started before the last commit is still active. In other words, the following scenarios are allowed: in these drawings, the green mutable transaction starts first, and the blue transaction starts after the commit of the green one. The pink immutable transaction starts before everything else. The pink transaction can continue even after the commit of the green transaction, and will read the database as it was before the green transaction started.
However, before starting the blue transaction, we must wait for the end of the pink one.
As shown in the following drawing, it does not matter whether the pink transaction was started before or after the beginning of the green one:
Another notable feature of this system is that another immutable transaction (drawn in orange in the below diagram) can even be started between the two mutable transactions, in parallel to the pink transaction. In that case, the orange and pink transactions will read different versions of the database. This is allowed because the orange transaction starts after the commit of the green one.
The size of nodes in the B tree is 4096 bytes, which is the size of memory pages on most platforms (UltraSPARC being an exception, with 8192 bytes pages).
Pages are often considered to synchronise to disk atomically1, even though larger chunks might only be partially synchronised. This is why it is always safe to choose a smaller block size than the page size of the platform: the larger pages on UltraSPARC will be slower to synchronise, but that synchronisation will still happen atomically, thanks to disk caches.
In Sanakirja, the first page contains pointers to (1) the list of free pages and (2) a collections of at most 500 different pointers to B trees.
In a mutable transaction, existing pages are not modified: they are instead copied to a newly allocated page. This sounds horrible in terms of performance, but it is not that bad in practice, since we have to copy these pages to memory anyway before we can access them in the program, and write them back to disk after modifying them, and these synchronous input/output operations are much more expensive than copying page-size blocks, even on SSDs.
I wrote before about the lock model of Sanakirja, but this was limited to the scope of a single process, and was relying on file locks to work, in the sense that any process using Sanakirja would first have to take an exclusive file lock, and then use the correct concurrency model between threads inside a process.
However, Pijul sometimes needs a bit more than this: for example, @cole-h reported in issue #43 that
pijul record locked up other commands, which was inconvenient if you wanted to open a log while editing a change.
It is of course fundamental to prevent two records, or a record and a pull, to happen in parallel, but a full exclusive lock was clearly too much.
The system needs three atomic variables in order to work:
Each immutable transaction remembers the value of the clock at the time it starts, and increments the active transaction counter. When it stops, it decrements the counter of active immutable transactions, and if it started before the last commit, also decrements the counter of transactions active during the last commit.
I considered two different systems to implement this in a cross-process way:
Reading and updating all these values at once to a file at each operation (maybe using shared memory), where the file would be protected by a lock (or a named mutex or semaphore). This sounds feasible in principle, but there is a catch: at the end of the last immutable transaction that was started before the last commit, we need a mechanism to signal to a potential mutable transaction that it can start. Between threads, this is achieved with condition variables. Between processes, this is more complicated: we could use Unix signals, but their handlers can interrupt any running function, which means that these handlers cannot take any mutex, and cannot talk to Tokio at all, which limits their usefulness.
The other option, which I ended up implementing, is to write a barebones implementation of the locking model in an external process, using Tokio tasks to model transactions, and communicate with that Tokio runtime with Unix Domain Sockets. Every time a transaction starts, it tries to connect to a Unix Domain Socket at a specified location. If that fails, a new process is forked, to run the
pijul lock hidden command in the background (using std::process::Command::spawn), which opens a Unix Domain Socket and prints a new line to its standard output. The parent process waits for that newline, and then connects to the socket. The
pijul lock command then works as follows:
locked). All these messages are encoded on one byte.
pijul lockwaits for one second, checks whether there are other clients, and then exits if there are none. This delay avoids starting instances of
pijul lockover and over again when using pijul in a script.
First, this only works on Unix platforms at the moment, but it seems the equivalent of UDS on Windows might be implemented soon in Tokio. Pijul on Windows still relies on standard file locks.
Moreover, this is implemented in Pijul at the moment, not in Libpijul, because Libpijul is not meant to depend on Tokio in the long run (that dependency will be dropped before the beta of 1.0).
A finaly thing to note is that file locks are advisory, and so is this external process. This means that they are safe to use as long as all processes that write to the database are aware of them. However, as is the case for all files in a filesystem, a process that refuses to cooperate, and has write access to a file, will always be able to corrupt it.
Thanks to HackerNews user @vlowther for pointing out an oversimplification in an earlier version of this post. Sanakirja relies on the atomicity of $24 + 8r$ bytes, where $r$ is the number of “root” tables. In most cases, this is way less than 4Kb. When loads of tables are needed, Sanakirja can represent “tables of tables, indexed for example by an integer” (this has the Rust type
Db<u64, Db<K, V>>), using a single root table, in which case the atomicity requirement falls down to 32 bytes. ↩︎