Compressed Sanakirja databases

Monday, January 1, 0001
By Pierre-√Čtienne Meunier

About Sanakirja

Sanakirja is a general key-value store designed specifically to implement efficient database forks (which Pijul uses for channels). Not all features of a fully general key-values store are implemented at the moment: in particular, large values are still missing. If you are interested in using Sanakirja for a specific project, and need extra features, feel free to reach out.

Compressing B-trees

The main topic of this post is a trick I wanted to share to compress Sanakirja databases. I recently needed to write Rust bindings to the “seekable format” of Zstd, because this is how changes are stored in Pijul. That format compresses files by blocks of some constant size instead of compressing them in one go. The main advantage is that one can decompress only a small part of the file without having to decompress it all.

In the example below, we build a “backend” for Sanakirja using a compressed database file. The full source of this is available on the Nest.

The first step is to start a sanakirja transaction, and add a few values.


const N: usize = 1000;

fn add_values(env: &Env<sanakirja::Exclusive>) -> Vec<u64> {
    let mut txn = sanakirja::Env::mut_txn_begin(env).unwrap();
    let mut db: sanakirja::Db<u64, u64> = txn.create_db().unwrap();
    let mut rng = rand::thread_rng();
    let mut values = Vec::with_capacity(N);
    for i in 0..N {
        let v = rng.gen();
        txn.put(&mut rng, &mut db, i as u64, v).unwrap();
        values.push(v);
    }
    txn.set_root(0, db);
    txn.commit().unwrap();
    values
}

This is fairly straightforward: we start a mutable transaction, put in a few random values (using the random number generator in the call to txn.put, if you remember what I wrote above about skip lists). Finally, we set the root number 0 to our table db, and commit.

Note that this interface is safe from a Rust point of view, but not really safe from a database point of view: indeed, if you forget to store db either in a “root” (there can be at most 500 roots), or in another table, you will leak pages. This isn’t unsafe per se, but there is nothing in Sanakirja to reclaim these pages later. Even worse, if this isn’t the first transaction, and db is meant to be stored as a root, the former value of the root must be updated, as it still points to the previous value of the root. Since pages are reused after they’re freed, potentially in other tables, this can really corrupt the entire file database.

All these things are handled in Pijul by using a cache of opened databases, and updating them automatically when committing a transaction.

Back to our example, the next function we write is a way to read the sanakirja file as bytes, and compress it by blocks of BLOCK_SIZE=4096 using Zstd-seekable.

It turns out that Sanakirja databases are split into smaller chunks, implemented as either mmapped files, or plain allocated memory. Moreover, the last chunk may not be entirely used. What we do here is, we iterate over these chunks, and compress them until the last page used by the database. In this basic implementation, we will assume, when decompressing, that the size of the database is a multiple of the block size, so we check this here.


const BLOCK_SIZE: usize = 4096;

fn compress_env<T>(env: &sanakirja::Env<T>, file_name: &str) {
    // Find the total length used by Sanakirja
    let mut total = *env.first_unused_page.lock().unwrap();
    assert_eq!(total % (BLOCK_SIZE as u64), 0);

    // Initialise a Zstd compressor
    let mut comp = zstd_seekable::SeekableCStream::new(19, BLOCK_SIZE).unwrap();
    let mut out = Vec::new();
    let mut f = std::fs::File::create(file_name).unwrap();

    // Then, for each chunk in the Sanakirja database, compress and write it.
    for i in env.mmaps.lock().unwrap().iter() {
        let l = i.length().min(total);
        out.resize(l as usize, 0u8);
        let input = i.as_slice();
        let mut input_off = 0;
        let mut output_off = 0;
        while input_off < l as usize {
            let (a, b) = comp.compress(
                &mut out[output_off..],
                &input[input_off..],
            ).unwrap();
            input_off += b;
            output_off += a;
        }
        f.write_all(&out[..output_off]).unwrap();
        total -= l;
    }

    // Finally, write the final Zstd block.
    if out.is_empty() {
        out.resize(4096, 0);
    }
    let l = comp.end_stream(&mut out).unwrap();
    f.write_all(&out[..l]).unwrap();
}

We are now ready to define a Sanakirja backend, which is an implementation of the sanakirja::LoadPage trait. Note that this backend will not allow us to write to the database, but at least we’ll be able to read very efficiently.

We first define the type of that backend, it only needs two fields:

struct ZTxn<'a, R> {
    s: RefCell<zstd_seekable::Seekable<'a, R>>,
    loaded: RefCell<HashMap<u64, [u8; BLOCK_SIZE]>>,
}

We need to implement two methods for sanakirja::LoadPage: the first one, load_page, loads a page into the main memory and returns a pointer to it, together with the offset to that page in the file. Here, this is done by returning either a pointer to an existing block from the HashMap, or by inserting one in the HashMap, and returning a pointer to that (this also explains why we need two RefCell instead of one):

impl<'a, R> sanakirja::LoadPage for ZTxn<'a, R> {
    fn load_page(&self, off: u64) -> sanakirja::Page {
        use sanakirja::Page;
        let off_aligned = (off / BLOCK_SIZE as u64) * BLOCK_SIZE as u64;
        let p = match self.loaded.borrow_mut().entry(off_aligned) {
            Entry::Occupied(e) => unsafe {
                e.get().as_ptr().offset((off - off_aligned) as isize)
            },
            Entry::Vacant(e) => {
                let mut buf = [0; BLOCK_SIZE];
                self.s.borrow_mut().decompress(&mut buf, off_aligned).unwrap();
                unsafe {
                    e.insert(buf).as_ptr().offset((off - off_aligned) as isize)
                }
            }
        };
        Page {
            data: p,
            offset: off,
        }
    }

Note that the off parameter to load_page is an offset to a page, so it is a multiple of 4096. Since BLOCK_SIZE may be chosen larger than that (but must still be a multiple of 4096, in this example), some arithmetic is needed to decompress the right block and find the page. This isn’t entirely necessary though, we could as well use the zstd-seekable API to retrieve a single page. However, if all the pages in the block are read at least once, this would decompress the entire block BLOCK_SIZE/4096 times.

The second method relies on the internal data representation in a Sanakirja database: indeed, the roots are all stored in a first page as little-endian u64, after an initial preamble of size sanakirja::ZERO_HEADER. In the following implementation, we load the first page (page 0) using the load_page method, and then read the little-endian 64-bit integer number num after the preamble. Since the size of the preamble is guaranteed to be a multiple of 8 (i.e. 64-bit-aligned), we can read it with a simple dereference. Then, the u64::from_le converts little-endian integers to the platform’s endianness.

    fn root_(&self, num: usize) -> u64 {
        let page = self.load_page(0);
        unsafe {
            u64::from_le(
                *(page.data
                  .offset(sanakirja::ZERO_HEADER) as *const u64)
                    .offset(num as isize)
            )
        }
    }
}

The LoadPage trait is used internally in Sanakirja, and isn’t very useful in itself. It is used to implement, among other things, txn.get and iterators. To use high-level methods, we need to implement the following trait (which has no required methods):

impl<'a, R> sanakirja::Transaction for ZTxn<'a, R> {}

Putting it all together, we create an in-memory Sanakirja environment (of size 100 pages), add our values, compress it, and read it back with a Sanakirja iterator:


fn main() {
    let env = Env::new_anon(409600).unwrap();
    let values = add_values(&env);
    let file_name = "compressed.zst";
    compress_env(&env, file_name);
    let txn = ZTxn {
        s: RefCell::new(zstd_seekable::Seekable::init_file(file_name).unwrap()),
        loaded: RefCell::new(HashMap::new())
    };
    let db: sanakirja::Db<u64, u64> = txn.root(0).unwrap();
    for (k, v) in txn.iter(&db, None) {
        assert_eq!(values[k as usize], v);
    }
}

Conclusion

I ran a few examples with a block size of 16 pages, and the results are presented in the following two plots. For each size shown in the plots, I ran the experiment ten times (except for 10 million keys, where I ran it five times), and displayed each point as the mean, plus error bars representing the full range of times and sizes.

The first plot represents the reading time in log-log scale, which shows that the time for iterating through the entire database seem asymptotically linear in the number of keys in the table. The gap between the two curves corresponds to a factor of ~4, meaning that it takes roughly four times as long for the compressed version than for the uncompressed one.

The next plot shows the raw and compressed sizes of the database, this time in a log-linear scale, which highlights the fact that the raw database takes a space linear in its number of keys, while the compressed database grows very slowly.

The following image shows the same plot on a log-log scale again. We can see that both grow somewhat linearly (which is expected, since that we’re feeding it with random input), but the growth of the compressed database is much below the $y = x$ line, in fact for ten million keys, the compressed database is $10^4$ times smaller than the uncompressed one. Also for 10 million keys, the compressed database takes less than 0.002 byte per key/value pair!