Sanakirja is a key-value store based on B trees, written in Rust. In this post, I reflect on some design and programming methodology mistakes I’ve made. Benchmarks show that the latest version beats LMDB, the fastest C equivalent, by 20%-50% on a use case similar to my most common load (storing graphs in Pijul).
My last post about Sanakirja sparked a few really constructive discussions, and made me realise that people still cared about the problem of on-disk key-value stores, as unfancy as that problem may sound. This post looks back on some design mistakes I’ve made when I wrote it, and includes benchmarks showing it’s now faster than the fastest equivalent C library.
Also, as some readers have noted, “sanakirja” means “dictionary” in Finnish.
A long time ago, Pijul was using LMDB as its backend, with a number of fundamental limitations, including:
Being restricted to two datatypes: an array of B+ trees where keys are byte strings and values are either (1) byte strings or (2) B+ trees where keys are bytestrings and values are zero-sized. In Rust terminology, this is equivalent to roughly 500 tables, where a table is either BTreeMap<&[u8], &[u8]>
or BTreeMap<&[u8], BTreeMap<&[u8], ()>>
(a.k.a. BTreeMap<&[u8], BTreeSet<&[u8]>>
).
Being written in C, meaning that it is potentially fast, but hard to extend in any nontrivial way. About the “fast” part, my benchmarks show that indeed, it is quite fast — just not as fast as a carefully-designed Rust version.
More importantly, I needed to fork tables efficiently, without copying anything. This was especially important back then, when Pijul tables for small repositories often weighed dozens of megabytes. It may be slightly less relevant now, but now that it’s there, there is no reason not to use it.
This is implemented with an extra table storing reference counts of each page that is referenced at least twice (in order to avoid infinite recursions, the table itself isn’t clonable, and therefore all its pages are referenced once).
There were just too many reasons: performance was first, then the overly complicated concurrency model, the suboptimal algorithmic design, and the suboptimal software design. I’ll explain these in the reverse order (so skip ahead if you only want to see my benchmarks).
Sanakirja was the first Rust project I completed, and yet it requires all the worst memory tricks I’ve ever needed in Rust. In hindsight, I needed that crate, so it was right to write it, but this wasn’t the right way to learn Rust at all. I also knew very little about disk performance, and my goal at the time was really writing Pijul, not profiling databases. And as the benchmarks show, it wasn’t terribly slow either.
That’s right, this new version works without the standard library. It’s fairly restricted in that case, since there’s no mmap. However, the allocator is generic, meaning that it could in principle be used to allocate different things, like hard drive sectors in a filesystem.
Implementing B trees isn’t particularly hard in theory, but some bits of practice make it non-trivial, like keys of varying sizes (these can even cause the tree to increase its depth in order to stay balanced and keep their long-term performance guarantees!). Another complication is that the main costly operation in an on-disk datastructure is accessing a sector that isn’t in RAM, either for reading or for writing, and keys and values may be large, so we need to avoid copying or moving things around. Yet another nontrivial feature is that we want operations to be transactional, and allow readers to access the database concurrently with a writer (in which case readers see the database as it was before the current writer started).
One issue is that in a first-time implementation, it isn’t obvious to tell what belongs to the “B tree level” (which handles the high-level operations and does the “page management”, meaning allocating and freeing pages), from the underlying “node level” implementation (dealing with byte representation of data). Before you know it, the two levels are mixed and make hypotheses on one another. This makes it hard to see what matters performance-wise, and what doesn’t. After looking at my code again recently, I’ve tried to formalise here the lessons I’ve learned since my first steps in Rust back in 2015:
Trust the Rust compiler! In a stage of the learning where you spend most of your time fighting the borrow checker (it was worse then, we have non-lexical lifetimes now), it is sometimes hard to see past that stage and imagine how the compiler will optimise your code, or what the LLVM pass will do. Also, on-disk B trees must define their own data representation, since data stored with one version of the Rust compiler must be readable from another version. However, one thing I’ve noticed when rewriting it was that I was often reasoning about what branches the compiler would cut, what loops it could unfold, etc.
I guess what I mean here is, sometimes you don’t need complex genericity everywhere in order to reach your performance goals, checking a constant parameter at runtime, or thinking carefully about mutable and immutable variables, can actually become compile-time optimisations.
I usually tend to try and get stuff to work as quick as possible, then profile and improve (the make it work, then make it fast idea). However, this particular crate made we question that approach in some cases, because it tends to mix the different levels, and make them less clear. One thing I’ve learned is that spending time building, benchmarking and profiling basic blocks may seem like it can delay the overall plan; however, in this case, exploring different designs for the basic parts made it clear which features of the basic blocks were varying, and which were necessary, yielding a more modular design in the end, which was key to the performance improvements I got.
Part of the art of writing mathematical proofs, especially in a larger paper, is finding the “min-cuts” of the graph of arguments. In the case of proofs, this means finding the smallest set of hypotheses you need for a lemma, which can also be proven independently in other lemmas. This isn’t easy, because a smaller conclusion may sometimes mean you’ve lost some hypotheses you had during the proof, and you won’t be able to use them anymore. The same happens in code, when you refactor common operations and try to use nice, readable abstractions: while that sounds like a good thing to do, you may lose hypotheses on what is behind the abstractions.
Concretely, in Sanakirja, we can’t always write to pages to modify them, for example if they’re pointed to by another tree, or if the are visible to concurrent readers. This isn’t very complicated in the case of insertions, because the worst thing that may happen is a page split. The 169 lines of Rust for the new version of put
reflect that.
However, deletions are harder, since page merges might cascade: if two pages A and B are meant to be merged, the key between them in their parent P must be deleted from P, and that deletion might cause P to merge with its neighbour Q. However, if P is read-only and Q isn’t, doing the operation lazily means that we can merge P’s items directly into Q, rather than first copying P to delete a key, and then merging.
When I first wrote Sanakirja, I remember being pretty proud of the following trick to achieve this: I refactored common code in order to use the same function for merging, rebalancing and splitting pages: I would first create an iterator, and then “spread” it on one page (when merging) or two pages (when splitting or rebalancing). While this may seem an elegant refactoring of common code between the three main operations of a B tree, it is also “lossy” in terms of the hypotheses it can make over the items of the iterator.
More precisely, here is an edited version of what I was doing before (iter_all
returns an iterator, a struct
with private fields):
// Build an iterator for the left page
let it_a = iter_all(&page_a).map(|(ptr, kv, right)| PageItem {
ptr,
kv,
right,
});
// Build an iterator for the right page
let mut it_b = iter_all(&page_b).map(|(ptr, kv, right)| PageItem {
ptr,
kv,
right,
});
// Skip the left child of the right page (it will be replaced).
let b0 = it_b.next().unwrap();
let new_page = {
let it = it_a
.chain(std::iter::once(PageItem {
ptr: middle_ptr,
kv: Some((middle_key, middle_value)),
right: b0.right,
}))
.chain(it_b);
// If one of the elements needs to be replaced (because we're
// deleting in an internal node), adapt the iterator.
let it = WithChanges::new(del, it, replacement);
// Finally, make a single page with all the items.
self.spread_on_one_page(it)?
};
Compared to the actual old version, I’ve essentially removed all instructions related to memory management: should we increase the reference count of the child pages? of the keys and values? Since we’ve lost all hypotheses on the origin and destination of the elements, this is incredibly hard to get right, and the WithChanges
iterator, along with the spread_on_one_page
method, have quite a few lines to try and recover these hypotheses.
My new version for the same thing is quite different: it describes the operation a bit more verbosely, by means of a struct to represent the modifications that need to be applied to a page:
pub struct ModifiedPage<
'a,
K: Representable,
V: Representable,
P: BTreePage<K, V>,
> {
pub page: CowPage,
// Whether the page can be written to (used for RC).
pub mutable: bool,
// Start copying c0 (keep `page`'s left child).
pub c0: P::Cursor,
// If > 0, replace the right child of c0's last element with `l`.
pub l: u64,
// If > 0, replace the left child of c1's last element with `r`.
pub r: u64,
// Possibly insert a new binding (replacement for a deletion).
pub ins: Option<(&'a K, &'a V)>,
// And another one (in case a child has split).
pub ins2: Option<(&'a K, &'a V)>,
// The first element of c1 is to be deleted, the others must be copied.
pub c1: P::Cursor,
// Whether to skip `c1`'s first element.
pub skip_first: bool,
}
And then another struct to represent the concatenation of a modified page and a “plain” one:
pub struct Concat<'a, K: Representable, V: Representable, P: BTreePage<K, V>> {
// Key between `modified` and `other`.
pub mid: (&'a K, &'a V),
// Modified page
pub modified: ModifiedPage<'a, K, V, P>,
// The other page in the concatenation
pub other: CowPage,
// Is the modified field on the left or on the right of the
// concatenation?
pub mod_is_left: bool,
}
These structures serve two purposes: first, they make it easy to evaluate which decision must be taken (does the concatenation fit on one page, or do we need to rebalance the left and right pages instead?). But also, they make it easy to reason about updates to the reference counts, since we know precisely where each element of the page comes from, and where they go.
One thing I want to note here is that my previous design doesn’t strike me as an obvious form of premature abstraction: there are no module hierarchies, verbose keywords or (God forbid) UML diagrams. Just a simple iterator used as a shortcut. My second design is less abstract, more verbose and pedantic, yet it is very easy to reason about: the function that updates the reference counts (called modify_rc
in the code for del) is not particularly short, but is really straightforward, compared to the guessing of the WithChanges
iterator in the previous version.
In an attempt to improve read and write performance, the nodes of my B tree were using a mini-skiplist. While that wasn’t fundamentally slow, it was still slower for the size of a page, than naive solutions like binary search over a sorted array, itself slower than linear search over a (potentially unsorted) array. I remember implementing them because I like the idea of skiplists, and then thought of them as a modular brick I could replace later to improve performance as needed. However, I failed to implement them in a modular way (or rather, I initially succeeded, but then debugging happened), and ended up contaminating the rest of my code, making it hard to replace.
In the new version, I’m still sorting my internal nodes because I want to traverse them in order, but they’re just basic sorted arrays. I don’t have benchmarks about that particular issue, but sorting the arrays allows the search to stop as soon as one element larger than the searched key is found, whereas unsorted arrays force the algorithm to traverse each node entirely.
First, I wasn’t happy with the concurrency model, especially cross-process and cross-platform. The hack I was describing in my previous post turned out to be not as reliable as I wished (as often in the last three months, @tankf33der’s careful tests spotted some problems). That hack was designed to work around a flaw in my original design, where all transactions read the same “zero” page to get a reference to the current version of the database, and that page needed to be protected. This is fixed now, I’m instead using many pages (each with its own filesystem lock). There is a “current” version, and each mutable transactions increments the current version (modulo the maximum number of versions).
This means that Sanakirja is no longer restricted to starting the next mutable transactions once all readers started before the previous commit have ended. Instead, readers can see at most a constant number of commits, set when the database is created.
I’ve benchmarked Sanakirja against three other implementations:
BTreeMap
s only store Sized
types, whereas Sanakirja also stores dynamically-sized types, a major source of complications in the code.Here is the time it takes to insert between 1 and 10 millions of (u64, u64)
entries into each of the databases tested:
Someone asked about performance with (UUID, u64)
instead of (u64, u64)
, here it is:
This is fast! I’m not entirely sure why it is so much faster than LMDB in these benchmarks, but here are a few hypotheses: first, the arity of internal nodes is higher, as Sanakirja doesn’t store the lengths on the page. Also, Sanakirja has a special implementation of leaves for types whose size is known at compile-time, and experiments I ran showed a ~15% performance gain compared with a basic “unsized” implementation. I guess compile-time optimisations enabled by genericity make up for the rest (one could probably get similar results in C++, but it would probably take longer to write and debug).
I should add that even though I’ve run many careful tests, and I’ve manually inspected the trees of large databases (a feature available in the debug
module of Sanakirja), these things can be a bit error-prone, as manual memory-management is involved. However, I don’t expect performance to change significantly as things stabilise. One thing that could change performance a bit is checking the CRCs of each page, which the old Sanakirja was doing. I believe this could be done in parallel to the main operations.
If you are interested in helping, a few things require attention: first, a large part of this is unsafe Rust, for a number of reasons. Even though I don’t think we can completely get rid of raw pointers and lifetime extensions (we’re managing memory manually, after all…), getting array bound checks back would be useful. More experiments with different page implementations would be interesting as well, even though the trait for pages is rather large.
I want to thank Clément Renault and Marin Postma (from Meilisearch, a fast open source search engine, written in Rust) for providing the (much needed) cheering for this redesign, and for the fruitful discussions that followed.