Rethinking Sanakirja

Friday, February 5, 2021
By Pierre-Étienne Meunier

Sanakirja is a key-value store based on B trees, written in Rust. In this post, I reflect on some design and programming methodology mistakes I’ve made. Benchmarks show that the latest version beats LMDB, the fastest C equivalent, by 20%-50% on a use case similar to my most common load (storing graphs in Pijul).

My last post about Sanakirja sparked a few really constructive discussions, and made me realise that people still cared about the problem of on-disk key-value stores, as unfancy as that problem may sound. This post looks back on some design mistakes I’ve made when I wrote it, and includes benchmarks showing it’s now faster than the fastest equivalent C library.

Also, as some readers have noted, “sanakirja” means “dictionary” in Finnish.

Why?

A long time ago, Pijul was using LMDB as its backend, with a number of fundamental limitations, including:

Why a redesign?

There were just too many reasons: performance was first, then the overly complicated concurrency model, the suboptimal algorithmic design, and the suboptimal software design. I’ll explain these in the reverse order (so skip ahead if you only want to see my benchmarks).

Personal anecdote

Sanakirja was the first Rust project I completed, and yet it requires all the worst memory tricks I’ve ever needed in Rust. In hindsight, I needed that crate, so it was right to write it, but this wasn’t the right way to learn Rust at all. I also knew very little about disk performance, and my goal at the time was really writing Pijul, not profiling databases. And as the benchmarks show, it wasn’t terribly slow either.

#[no_std]

That’s right, this new version works without the standard library. It’s fairly restricted in that case, since there’s no mmap. However, the allocator is generic, meaning that it could in principle be used to allocate different things, like hard drive sectors in a filesystem.

Software engineering

Implementing B trees isn’t particularly hard in theory, but some bits of practice make it non-trivial, like keys of varying sizes (these can even cause the tree to increase its depth in order to stay balanced and keep their long-term performance guarantees!). Another complication is that the main costly operation in an on-disk datastructure is accessing a sector that isn’t in RAM, either for reading or for writing, and keys and values may be large, so we need to avoid copying or moving things around. Yet another nontrivial feature is that we want operations to be transactional, and allow readers to access the database concurrently with a writer (in which case readers see the database as it was before the current writer started).

One issue is that in a first-time implementation, it isn’t obvious to tell what belongs to the “B tree level” (which handles the high-level operations and does the “page management”, meaning allocating and freeing pages), from the underlying “node level” implementation (dealing with byte representation of data). Before you know it, the two levels are mixed and make hypotheses on one another. This makes it hard to see what matters performance-wise, and what doesn’t. After looking at my code again recently, I’ve tried to formalise here the lessons I’ve learned since my first steps in Rust back in 2015:

Node datastructure

In an attempt to improve read and write performance, the nodes of my B tree were using a mini-skiplist. While that wasn’t fundamentally slow, it was still slower for the size of a page, than naive solutions like binary search over a sorted array, itself slower than linear search over a (potentially unsorted) array. I remember implementing them because I like the idea of skiplists, and then thought of them as a modular brick I could replace later to improve performance as needed. However, I failed to implement them in a modular way (or rather, I initially succeeded, but then debugging happened), and ended up contaminating the rest of my code, making it hard to replace.

In the new version, I’m still sorting my internal nodes because I want to traverse them in order, but they’re just basic sorted arrays. I don’t have benchmarks about that particular issue, but sorting the arrays allows the search to stop as soon as one element larger than the searched key is found, whereas unsorted arrays force the algorithm to traverse each node entirely.

Concurrency

First, I wasn’t happy with the concurrency model, especially cross-process and cross-platform. The hack I was describing in my previous post turned out to be not as reliable as I wished (as often in the last three months, @tankf33der’s careful tests spotted some problems). That hack was designed to work around a flaw in my original design, where all transactions read the same “zero” page to get a reference to the current version of the database, and that page needed to be protected. This is fixed now, I’m instead using many pages (each with its own filesystem lock). There is a “current” version, and each mutable transactions increments the current version (modulo the maximum number of versions).

This means that Sanakirja is no longer restricted to starting the next mutable transactions once all readers started before the previous commit have ended. Instead, readers can see at most a constant number of commits, set when the database is created.

Benchmarks

I’ve benchmarked Sanakirja against three other implementations:

Here is the time it takes to insert between 1 and 10 millions of (u64, u64) entries into each of the databases tested:

Write performance

Someone asked about performance with (UUID, u64) instead of (u64, u64), here it is:

Read performance

Conclusion

This is fast! I’m not entirely sure why it is so much faster than LMDB in these benchmarks, but here are a few hypotheses: first, the arity of internal nodes is higher, as Sanakirja doesn’t store the lengths on the page. Also, Sanakirja has a special implementation of leaves for types whose size is known at compile-time, and experiments I ran showed a ~15% performance gain compared with a basic “unsized” implementation. I guess compile-time optimisations enabled by genericity make up for the rest (one could probably get similar results in C++, but it would probably take longer to write and debug).

I should add that even though I’ve run many careful tests, and I’ve manually inspected the trees of large databases (a feature available in the debug module of Sanakirja), these things can be a bit error-prone, as manual memory-management is involved. However, I don’t expect performance to change significantly as things stabilise. One thing that could change performance a bit is checking the CRCs of each page, which the old Sanakirja was doing. I believe this could be done in parallel to the main operations.

If you are interested in helping, a few things require attention: first, a large part of this is unsafe Rust, for a number of reasons. Even though I don’t think we can completely get rid of raw pointers and lifetime extensions (we’re managing memory manually, after all…), getting array bound checks back would be useful. More experiments with different page implementations would be interesting as well, even though the trait for pages is rather large.

Acknowledgements

I want to thank Clément Renault and Marin Postma (from Meilisearch, a fast open source search engine, written in Rust) for providing the (much needed) cheering for this redesign, and for the fruitful discussions that followed.