Thoughts on branches

Monday, February 15, 2016
By Pierre-Étienne Meunier

Since our release of Pijul 0.2, I’ve been working mostly on the biggest feature of Pijul 0.3: branches. In this post, I try to explain the current plan.

Do we need branches?

First, branches in Pijul are not like branches in Git. Actually, Git branches are like Pijul’s patches, so in order to produce the simplest possible system, I was once tempted to find simpler ways to map Git workflows to patches. Sometimes, however, we do need some form of branches. There are two use cases I can think of:

Abstract concept (what the user sees)

Branches are of course mostly an engineer’s thing, but they’re an absolutely essential feature to have, and a clean definition would make Pijul really easy to learn.

I’ve been thinking of two different definitions/semantics:

  1. Branches as sets of patches.
  2. Branches as sets of patches and other branches.

The first definition has the merit of simplicity, both in implementation and in learning curve, whereas the other would for instance allow an experimental branch to follow a stable branch, even if patches do not pass through the experimental branch first.

Implementation and complexity

These ideas definitely call for a purely functional implementation: with the same implementation as in Pijul 0.2, we could get branches almost for free. Two branches would share everything but their differences. Of course, branches that forked long ago and are “mostly merged” would not share much, but this is probably unavoidable, or at least would require whole new ideas.

I’ve contacted the authors of LMDB (the fantastic backend library we’re currentyl using), in order to know whether it would be possible to add some limited form of referential transparency to their B+ trees. They kindly replied that they would probably not do it, which I definitely understand: why would someone want to add such an overhead in space and maintenance for a feature probably used by a single user?

So, I’ve decide to start a new LMDB. It will certainly not be as fast as the original one, but it will have this missing feature. This would give the following time and space complexities:

What we have now

After a few days of work, I have a full implementation of transactions on arbitrarily large files. My implementation is thread-safe, process-safe, and crash-resistant. It allows any number of concurrent readers and one writer at a time. Moreover, the writer does not exclude readers (readers still read the version before the writer started).

The only operation excluding readers is when a writer commits its transaction, which is done by a call to msync on a single page. I’m now starting to fill these pages with a useful data structure, like B+ trees.

I’ve tried to pay close attention to compatibility with 32-bit machines, even for large files. The only non-portable things right now are (1) calls to mmap and msync, which need to be translated to mmap64, and to windows when this part of the library becomes stable and (2) machines with page size other than 4kB are not yet supported. The only such architecture I’m aware of is SPARC (which also happens to be big-endian, but I already handle this).

Since I don’t want the nodes of my B-trees to be more than an x86-page, this probably means that nodes of more than 4kB (i.e. very large keys and/or values) will involve some amount of copying on SPARC.

What is left to do