Towards a hybrid snapshot/patch version control system

Friday, January 7, 2022
By Pierre-√Čtienne Meunier

Where I outline the design of tags in Pijul, which are a first step towards a more general hybrid snapshot/patch version control system. The Sanakirja tricks presented here can also be applied outside the context of Pijul, to create efficient compressed databases.

About Sanakirja

Sanakirja is a general key-value store designed specifically to implement efficient database forks (which Pijul uses for channels). Not all features of a fully general key-values store are implemented at the moment: in particular, large values are still missing. If you are interested in using Sanakirja for a specific project, and need extra features, feel free to reach out.

Compressing B-trees

The main topic of this post is a trick I wanted to share to compress Sanakirja databases. This trick uses the “seekable format” of Zstd, which compresses files by blocks of some constant size instead of compressing the entire file in one go. The main advantage is that one can decompress only a small part of the file without having to decompress everything.

In the example below, we build a “backend” for Sanakirja using a compressed database file. The full source of this is available on the Nest.

The first step is to start a sanakirja transaction, and add a few values.


const N: usize = 1000;

fn add_values(env: &Env<sanakirja::Exclusive>) -> Vec<u64> {
    let mut txn = sanakirja::Env::mut_txn_begin(env).unwrap();
    let mut db: sanakirja::btree::Db<u64, u64> = sanakirja::btree::create_db(&mut txn).unwrap();
    let mut rng = rand::thread_rng();
    let mut values = Vec::with_capacity(N);
    for i in 0..N {
        let v = rng.gen();
        sanakirja::btree::put(&mut txn, &mut db, i as u64, v).unwrap();
        values.push(v);
    }
    txn.set_root(0, db);
    txn.commit().unwrap();
    values
}

This is fairly straightforward: we start a mutable transaction, put in a few random values (using the random number generator in the call to txn.put). Finally, we set the root number 0 to our table db, and commit.

Note that this interface is safe from a Rust point of view, but not really safe from a database point of view: indeed, if you forget to store db either in a “root” (there can be at most 500 roots), or in another table, you will leak pages. This isn’t unsafe per se, but there is nothing in Sanakirja to reclaim these pages later. Even worse, if this isn’t the first transaction, and db is meant to be stored as a root, the former value of the root must be updated, as the value stored in the file still points to the previous table, which has been freed. This can corrupt the database.

All these things are handled in Pijul by using a cache of opened databases, and updating them automatically when committing a transaction.

Back to our example, the next function we write is a way to read the sanakirja file as bytes, and compress it by blocks of BLOCK_SIZE=4096 using Zstd-seekable.

It turns out that Sanakirja databases are split into smaller chunks, where each chunk is implemented either as an mmapped file, or as plain allocated memory. Moreover, the last chunk may not be entirely used. What we do here is, we iterate over these chunks, and compress them until the last page used by the database. In this basic implementation, we will assume, when decompressing, that the size of the database is a multiple of the block size, so we check this here.


const BLOCK_SIZE: usize = 4096;

fn compress_env<T>(env: &sanakirja::Env<T>, file_name: &str) {
    // Find the total length used by Sanakirja
    let mut total = *env.first_unused_page.lock().unwrap();
    assert_eq!(total % (BLOCK_SIZE as u64), 0);

    // Initialise a Zstd compressor
    let mut comp = zstd_seekable::SeekableCStream::new(19, BLOCK_SIZE).unwrap();
    let mut out = Vec::new();
    let mut f = std::fs::File::create(file_name).unwrap();

    // Then, for each chunk in the Sanakirja database, compress and write it.
    for i in env.mmaps.lock().unwrap().iter() {
        let l = i.length().min(total);
        out.resize(l as usize, 0u8);
        let input = i.as_slice();
        let mut input_off = 0;
        let mut output_off = 0;
        while input_off < l as usize {
            let (a, b) = comp.compress(
                &mut out[output_off..],
                &input[input_off..],
            ).unwrap();
            input_off += b;
            output_off += a;
        }
        f.write_all(&out[..output_off]).unwrap();
        total -= l;
    }

    // Finally, write the final Zstd block.
    if out.is_empty() {
        out.resize(4096, 0);
    }
    let l = comp.end_stream(&mut out).unwrap();
    f.write_all(&out[..l]).unwrap();
}

We are now ready to define a Sanakirja backend, which is an implementation of the sanakirja::LoadPage trait. Note that this backend will not allow us to write to the database, but at least we’ll be able to read very efficiently.

We first define the type of that backend, it only needs two fields:

We’re using mutexes for both, since the load_page function below takes a &self, meaning that we need interior mutability anyway, and this allows us to share the transaction between multiple threads (since it is read-only anyway).

struct ZTxn<'a, R> {
    s: Mutex<zstd_seekable::Seekable<'static, WithOffset<std::fs::File>>>,
    loaded: Mutex<HashMap<u64, Box<[u8; crate::tag::BLOCK_SIZE]>>>,
}

We need to implement two methods for sanakirja::LoadPage: the first one, load_page, loads a page into the main memory and returns a pointer to it, together with the offset to that page in the file. Here, this is done by returning either a pointer to an existing block from the HashMap, or by inserting one in the HashMap, and returning a pointer to that (this also explains why we need two RefCell instead of one):

impl ::sanakirja::LoadPage for ZTxn {
    type Error = zstd_seekable::Error;
    fn load_page(&self, off: u64) -> Result<::sanakirja::CowPage, Self::Error> {
        use ::sanakirja::CowPage;
        let off_aligned = (off / crate::tag::BLOCK_SIZE as u64) * crate::tag::BLOCK_SIZE as u64;
        let mut l = self.loaded.lock().unwrap();
        let p = if let Some(p) = l.get_mut(&off_aligned) {
            unsafe { p.as_mut_ptr().add((off - off_aligned) as usize) }
        } else {
            let mut buf = Box::new([0; crate::tag::BLOCK_SIZE]);
            self.s
                .lock()
                .unwrap()
                .decompress(&mut buf[..], off_aligned)?;
            let p = unsafe { buf.as_mut_ptr().add((off - off_aligned) as usize) };
            l.insert(off_aligned, buf);
            p
        };
        Ok(CowPage {
            data: p,
            offset: off,
        })
    }
}

Note that the off parameter to load_page is an offset to a page, which implies that it is a multiple of 4096. Since BLOCK_SIZE may be chosen larger than that (but must still be a multiple of 4096, in this example), in which case some extra arithmetic is needed to decompress the right block and find the page. This isn’t entirely necessary though, we could as well use the zstd-seekable API to retrieve a single page. However, if all the pages in the block are read at least once, this would decompress the entire block BLOCK_SIZE/4096 times.

The LoadPage trait is used internally in Sanakirja, and isn’t very useful in itself. It is used to implement, among other things, txn.get and iterators. To use high-level methods, we need to implement the following trait (which has no required methods):

impl<'a, R> sanakirja::Transaction for ZTxn<'a, R> {}

Putting it all together, we create an in-memory Sanakirja environment (of size 100 pages), add our values, compress it, and read it back with a Sanakirja iterator:


fn main() {
    let env = Env::new_anon(409600).unwrap();
    let values = add_values(&env);
    let file_name = "compressed.zst";
    compress_env(&env, file_name);
    let txn = ZTxn {
        s: Mutex::new(zstd_seekable::Seekable::init_file(file_name).unwrap()),
        loaded: Mutex::new(HashMap::new())
    };
    let db: sanakirja::Db<u64, u64> = txn.root(0).unwrap();
    for (k, v) in txn.iter(&db, None) {
        assert_eq!(values[k as usize], v);
    }
}

Conclusion

I ran a few examples with a block size of 16 pages, and the results are presented in the following two plots. For each size shown in the plots, I ran the experiment ten times (except for 10 million keys, where I ran it five times), and displayed each point as the mean, plus error bars representing the full range of times and sizes.

The first plot represents the reading time in log-log scale, which shows that the time for iterating through the entire database seem asymptotically linear in the number of keys in the table. The gap between the two curves corresponds to a factor of ~4, meaning that it takes roughly four times as long for the compressed version than for the uncompressed one.

The next plot shows the raw and compressed sizes of the database, this time in a log-linear scale, which highlights the fact that the raw database takes a space linear in its number of keys, while the compressed database grows very slowly.

The following image shows the same plot on a log-log scale again. We can see that both grow somewhat linearly (which is expected, since that we’re feeding it with random input), but the growth of the compressed database is much below the $y = x$ line, in fact for ten million keys, the compressed database is $10^4$ times smaller than the uncompressed one. Also for 10 million keys, the compressed database takes less than 0.002 byte per key/value pair!