darcs get http://pijul.org/sanakirja
Sanakirja just got enough features to replace LMDB in Pijul. It still lacks a few rebalancing operations, but apparently, there’s no leak, and it passes ```cargo test''', which has just 5 tests for now. As this is supposed to become Pijul’s standard format for repositories, I want to start documenting it, using this blog as a first step while the format is probably not 100% stable yet.
Sanakirja 0.2 implements B trees (not B+ nor B* trees, just B trees). Apparently, plain B trees increase the number of cache misses, compared to B+ trees, when traversing the database. The reasoning is the following: when traversing consecutive leaves, B trees force you to load some of your parents in cache to read at most as many values as the depth of the B tree, which is normally quite shallow. B+ trees don’t have that limitation, since they store all their data in leaves.
Moreover, in the case where each key is bound to at most one value, B+ trees don’t store values in the nodes, which means nodes can store more keys, and have bigger arity, reducing the tree depth. Actually, I’d like to see the actual calculations for that: for the same keys and values, there are as many nodes (including leaves) in a B tree as there are leaves in a B+ tree. I’m not sure the increase in the number of leaves does not compensate the gain in arity.
Moreover, in a B+ tree, all internal entries have to be stored twice, which, if all pages are half-full with b/2 entries (the worst case), means the extra storage size required is a fraction 2/b of the total size (probably very modest for small enough keys, i.e. large values of b).
In Pijul, things are a bit messier: first, our keys are bound to several values in the most performance-critical table, so we already lose one of the main benefits of B+ trees.
But then, there’s something else: since we want persistence (storing several versions of the database that share all but their differences), it doesn’t seem very easy to use B+ trees. Indeed, the bottom level is a list, and there’s no way that I know of to get persistence in a list without copying the whole list.
I’m not 100% certain how LMDB does it. They seem to use persistence as a way to provide crash resistance. However, at some point, sibling pages need to be updated, but they’re on a different page. What happens if the machine crashes between that update and the final commit?
One of the goals of Sanakirja and Pijul is to allow other implementations, or scripting, or both. This is why I want to explain how B trees are implemented in Sanakirja.
First, we need an efficient way to store a sorted list structure in the nodes of the B tree. In Sanakirja, after trying other structures, I finally opted for skip lists, one of the structures suggested by Nicolas Schabanel (thanks to him!), for at least two reasons:
They are really easy to implement: since they must fit on one memory page, we can use a constant “depth” (number of lists), which avoids all heap allocations, and makes all functions on a single page tail-recursive.
We keep copying nodes anyway, which offers great opportunities to derandomize.
I initially thought that the performance could not be guaranteed against an attacker who can insert keys in a prescribed order, but this is actually wrong: if they insert too many keys, this will trigger a copy, and hence a deterministic rebalancing.
Values larger than half the size of a memory page are tricky: apparently, LMDB implements this in very fancy way, by storing unused pages of the file, sorted in a B tree, and when a value n pages in size is stored, it finds the best spot to allocate it, by looking at the B tree of free pages.
This also seems to indicate that allocating a large value takes a time linear in the number of free pages.
In order to simplify the implementation, and make our database as small as possible on disk, I cowardly decided to go for a slightly less conveniient API, which is anyway more than enough for Pijul: values smaller than one quarter of a page are stored on a page. Other values are allocated on a linked list of pages.
This makes our page allocator very simple, as all blocks are the same size. The drawback is that the API for getting a value will sometimes receive an iterator on byte strings, instead of a single byte string (more precisely, it will always receive an iterator, and the iterator will iterate more than once on large values).