Sanakirja gets its full concurrency model

Wednesday, March 20, 2019

I just fixed a few remaining bugs in Sanakirja, the database backend behind Pijul, and took the opportunity to update its concurrency model.

The design

Sanakirja implements transactional operations on B-trees stored on disk. Its distinctive feature is to allow fast clones of a database (in O(log n), where n is the size of the database).

I’ll review in this post how we achieve this.

B-trees

B-trees are an amazing datastructure. They are trees whose nodes are “blocks” (usually memory pages, which are equal to disk blocks). We always insert in a block, which might cause blocks to split when the new insertion would make them larger than the size of a page.

Moreover, insertion only happens at the leaves, which means that the depth of the tree can only increase by splitting the root. This is how B-trees stay balanced: there is no special operation to do.

Deletions are somewhat trickier to implement, since the key we want to delete might be at an internal node.

Copy-on-write

In our case, the key to transactionality is a copy-on-write strategy: each time we want to write something, we clone the page first, and modify just the copy. I know this sounds like a costly operation, it’s actually not that costly, since data is loaded from most hard disks one full page at a time anyway, and also written one full page at a time. There are other ways of implementing transactions, but I won’t discuss them here: we have a very good reason for choosing copy-on-write, and that is fast clones (see below).

This strategy, however, is not without its problems. The main issue is that in B-trees, a change in a leaf might propagate upwards and cause the root to split when the root becomes overfull, or to disappear if we manage to merge all its children. Therefore, if we copy a page to update it, and later merge that same page with one of its siblings in the tree, the first copy would have been be useless.

The solution to this in Sanakirja is laziness: instead of changing anything in the page, we just remember that we should do it after we take care of the next level. In the source code of the del method, the type that stores the last operation is the enum Deleted<K, V>.

More design: fast clones

In Sanakirja, fast clones are done by keeping reference counts of all the pages. However, this could potentially break transactionality: if we cancel a transaction, how would we revert all changes in reference counts at the same time?

Our solution is to use an extra B-tree mapping page numbers to reference counts. In order to avoid circularity (since B-trees are implemented using reference counting), we store only reference counts that are at least 2, and never clone the reference-counting B-tree.

The recent bug

As strange as it may seem, the del method might sometimes cause some pages to split, increasing the total number of nodes in the tree. Moreover, a split can “cascade”, meaning that the key between the two newly created pages causes the parent to also split.

In the bug we discovered recently, we were freeing the page containing the separator key after the split had been updated. However, if the cascade happens with the same key, or in other words if the separator for the first split falls is used as the next separator, this is wrong: we should instead wait until the cascade is over before freeing any page containing keys.

The concurrency model

In Sanakirja 0.10, writers (also called “mutable transactions”) exclude each other, but do not exclude readers (or “immutable transactions”). Until now, we had a stricter concurrency implemented using read-write locks.

As explained in the docs, in the new concurrency model, mutable transactions wait before starting, until all readers started before the latest transaction commit are finished.

Edit (23/4/2019)

In the sanakirja::transaction module, we add a number of variables to “environments”:

use std::sync::{Condvar, Mutex, MutexGuard};
pub struct Env {
    ,
    /// The clock is incremented every time a Txn starts, and every
    /// time a MutTxn ends.
    clock: Mutex<u64>,
    /// Every time we commit, we count the number of active Txns.
    txn_counter: Mutex<usize>,
    /// Last commit date (according to clock) + number of active Txns
    /// at the time of the last commit. At the end of a Txn started
    /// before the last commit date, decrement the counter.
    last_commit_date: Mutex<(u64, usize)>,
    concurrent_txns_are_finished: Condvar,

    /// Ensure only one mutable transaction can be started.
    mutable: Mutex<()>,
}

The clock variable is used to compare the start “date” of immutable transactions, and compare them to the “date” of the last commit. The txn_counter variable counts the number of active immutable transactions: whenever we start an immutable transaction, we increment the clock, and the immutable transaction counter:

    /// Start a read-only transaction.
    pub fn txn_begin<'env>(&'env self) -> Result<Txn<'env>, Error> {
        let mut read = self.clock.lock()?;
        *read += 1;
        let mut counter = self.txn_counter.lock()?;
        *counter += 1;
        Ok(Txn {
            env: self,
            start_date: *read,
        })
    }

I also added a variable last_commit_date, containing two pieces of information: one is the date of the last commit (a u64), and the other one is the number of transactions that were active at the time of the last commit.

Finally, when we drop an immutable transaction, we decrement txn_counter. Moreover, if we are dropping the last transaction that started before the last commit, we signal condition variable concurrent_txns_are_finished.

impl<'env> Drop for Txn<'env> {
    fn drop(&mut self) {
        let mut m = self.env.txn_counter.lock().unwrap();
        *m -= 1;
        let mut m = self.env.last_commit_date.lock().unwrap();
        if self.start_date <= m.0 {
            m.1 -= 1
        }
        if m.1 == 0 {
            self.env.concurrent_txns_are_finished.notify_one()
        }
    }
}

And every time we start a mutable transaction, we wait for the end of all transactions that were started before the last transaction.

    pub fn mut_txn_begin<'env>(&'env self) -> Result<MutTxn<'env, ()>, Error> {
        let guard = self.mutable.lock()?;

        // Wait until all transactions that were started before
        // the start of the last mutable transaction are finished.
        let mut last_commit = self.last_commit_date.lock()?;
        while last_commit.1 > 0 {
            last_commit = self.concurrent_txns_are_finished.wait(last_commit)?;
        }
        
    }

Here is the relevant section of commit:

impl<'env> Commit for MutTxn<'env, ()> {
    fn commit(mut self) -> Result<(), Error> {
        
        let mut last_commit = self.env.last_commit_date.lock()?;
        let n_txns = self.env.txn_counter.lock()?;
        let mut clock = self.env.clock.lock()?;
        *clock += 1;
        last_commit.0 = *clock;
        last_commit.1 = *n_txns;
        
    }
}

Note that unlike the other new variables, where the Mutex could probably be replaced by atomic operations, the race condition on last_commit_date is slightly more serious: indeed, if an immutable transaction could be dropped at the same time as a commit, then the next call to mut_txn_begin would keep waiting for that immutable transaction to finish.

More news about Pijul

We’ve improved Pijul a lot since the last release (0.11, in November 2018). In particular, the new diff algorithms mean that Pijul is now a lot faster when recording. This impacts most operations, since Pijul automatically creates a temporary patch containing the unrecorded changes in the working copy before applying other patches to the repository.

Moreover, we’re working on making signing keys easier to use, thanks to the amazing work done by the Sequoia team on implementing PGP in Rust.