Towards 1.0

Saturday, November 7, 2020

After fixing the performance and scalability problems, we’re on our way to getting a stable Pijul. In this post, I explain what I’ve been up to in the recent months.

Context

Pijul has always been advertised as a research project, trying to implement a theory of patches that would be sound and fast. This is an ambitious goal, and became even more ambitious than initially envisioned.

One of the hardest challenges is that source code is by essence stateful, which makes it much harder to iterate over algorithm designs, like normal research projecst need to. For example, in order to get from our last published version to our current design, we have gone through many different variants, and there wasn’t much to publish.

Moreover, the UX aspect is what matters most in the end, and testing it on a real world project is the only way to get it there. However, unlike in a compiler, where bootstrapping is done one step at a time, and previous versions are always available to compile your current one, a version control system has the additional problem that the previous versions might not always be easily accessible if there is a bug.

One of the criticisms I’ve heard since I realised that better datastructures were possible is that I was “working secretely”. I certainly understand this feeling, but this is based on a misunderstanding of how research works. When I first had the idea that I’m explaining in this post, I realised that a complete rewrite would be needed. But for a very long time, almost nothing other than unusable, unreadable prototypes happened.

Back then, there wasn’t much to show, since it wasn’t even clear that the basic datastructure would work. And even when they started working at a large enough scale, it took me quite a bit of testing on large repositories before they started actually working.

This also implies that there wasn’t much to show for quite a while, since the new algorithm wasn’t usable until very recently, and any repository started before now would have become obsolete in a matter of days.

There were also persoprofessional reasons for this silence, described at the end of this post.

Pijul depends on two other projects I’ve started.

Sanakirja

One of these projects is Sanakirja, which is “just” a key-value store, but has the extra feature that databases can be cloned efficiently. I would have loved to just use an existing library, but there just isn’t any that has this cloning feature. However, the scope of Sanakirja is still quite modest, it does one thing and does it well. Obviously, it took some time to find the memory-management bugs, but I have good confidence that this is now done.

In previous releases of Pijul, databases were implemented with a single mmapped file containing the binary representation of B Trees. Despite their lower writing performance (compared to alternatives such as Log-structured merge-trees), and the complexity of the code for deletions, B Trees are very well suited to this use case: indeed, since they are trees, reference-counting the nodes is enough to implement efficient clones.

One of the remaining issues was that in order to grow the database, we needed to un-mmapped the file, grow it, and mmap it again. Since applying a single change in Pijul must be an atomic operation, we needed to cancel the transaction when that happened, and restart it with a bigger file.

Another issue is that I wanted the next libpijul to compile on platforms that don’t have mmap, such as WASM. However, if reallocating an mmapped file has a very low complexity (even though it does have a non-zero cost in terms of system calls), reallocating a chunk of memory often requires copying everything. This completely defeats the point of the algorithms in Pijul, which rely on a particular representation of the datastructures on the disk.

The main innovation in Sanakirja 0.13 is to use a vector of memory blocks (either in memory or mmapped from a file), of exponentially-increasing size. The overhead is just one extra indirection, the complexity of adding items is the same (since the operation of creating an extra block is $O(1)$). The exponentially-increasing sizes mean that the allocated memory is always at least half-full.

Thrussh

The other one is Thrussh. That library implements the SSH protocol, and tries to handle a number of key formats. The former is a surprisingly easy goal, and keeping up with Tokio versions has historically been the hardest bit, while the latter is the most horrendous hydra-like task, with new heads and legacy formats showing up every time you think you’re done.

How repositories used to work (and still do, to some extent)

Old-style repositories represented a single file by a directed graph $G = (V, E)$ of lines, where each vertex $v\in V$ represented a line, and an edge from $u \in V$ to $v\in V$, labelled by some change (also called patch) number $c$, could be read as “according to change $c$, line $u$ comes before $v$”.

This means that changes could introduce vertices and lines, as in the following example, where a line $D$ is introduced between $A$ and $B$:

Here, the thick line represents the change from the file containing the lines $A$, $B$, $C$ to the file with the new line $D$. An important feature to note is that vertices are uniquely identified, by the hash of the change that introduced them, along with a position in that change. This means that two lines with the same content, introduced by different changes, will be different. It also means that a lines keeps its identity, even if the change is applied in a totally different context.

Moreover, this system is append-only, in the sense that deletions are handled by a more sophisticated labelling of the edges. In the example above, if we want to delete line $D$, we just need to make a change mapping the edge introduced by $c_0$ to a deleted edge, which we label by the name $c_1$ of the change that introduces it:

From now on, we call the full edges alive, and the dashed ones dead.

We have just described the two basic kinds of actions in Pijul. There are no other. One kind adds vertices to the graph, along with “alive” edges around them, and the other kind maps an existing edge label onto a different one. In order to fully described the system, I also need to mention that the edge labels are given by two parameters: their status (alive, deleted, and a few others related to multiple files and technical details explained below) and the change that introduced them.

Dependencies

This scheme allows to defines dependencies between changes:

Of course, this is just the minimal set of dependencies needed to make sense of the text edits. Hooks and scripts may add extra language-dependent dependencies based on semantics.

Are edge labels minimal?

Our goals is to find the smallest possible system, both for reasons of mathematical aesthetics (why store useless stuff?) and the other one for performance. Therefore, one immediate question comes to mind: why even keep the change number on the edges?

In order to answer that question, suppose we don’t keep the labels, meaning that the maps happen between statuses only. Then, consider the following two situations:

Conflicts and CRDTs

According to what we described so far, there are two types of conflicts in Pijul:

Moreover, it is easy to show that Pijul implements a conflict-free replicated datatype (CRDT): indeed, we’re just adding vertices and edges to a graph, or mapping edge labels which we know exist because of dependencies.

However, Pijul’s datastructure models, in a conflict-free way, the conflicts that can happen over a text file. In fact, Pijul would remain a CRDT with or without the design choices explained above about edge labels: for example, we could decide that the “deleted” status has higher precedence. But as shown above, that wouldn’t model conflicts accurately.

Pseudo-edges

The scenario I want to talk about now is the sequential (i.e. single-user) situation where we start with many lines. Our user deletes all of them, adds one new line at the very beginning, and one at the very end, as shown in the following diagram:

If we represented this situation naively like in that diagram, the complexity of applying changes and outputting the repository would depend linearly on the size of the graph, as we would need to traverse the entire thing to know about line $C$, and know it comes after $B$.

The trick is to use what we call “pseudo-edges”, which are not part of any change, but are just here to keep the “alive subgraph” (the subgraph of the alive vertices) connected. Every time we delete an edge, we add pseudo-edges between the source of the edge and all the descendants of the target, like the dotted edge in the graph below:

Multiple files

The extension of this scheme to multiple files is done in the following way:

Disappointing performances, and a new hope

In the previous Pijul, the contents of each line-vertex was stored in a table in the database. This was not optimal, since:

  1. they weren’t stored compressed,
  2. the content of lines was stored twice: once in the change file, and another time in the database,
  3. some memory pages used to store long (>512 bytes) lines could be underfull. For example, a line of length 513 required 4096 bytes of storage.

As wasteful as this may seem, on the repositories we analysed, this was only about half of the problem. The other half of the disk space was taken by the graph. Not only were repositories taking a lot of space on the disk, but this was also making non-trivial use cases extremely hard to debug, as the graph would quickly become impossible to plot, even on a codebase as small as Pijul itself.

Moreover, some potential applications of Pijul called for a word-based diff. But the size of these graphs was already out of control for lines, and words would only make it worse, because that would require one vertex per word.

This was a major scalability issue, even bigger than parsing your particular format of SSH key (which is possibly the most frequently reported problem).

I did consider abandoning the project at that point, but just before that I wanted to try two ideas out:

In the end, I didn’t “fix” Sanakirja, but managed to solve both problems at the same time. If we could store groups of lines, we might as well store groups of bytes, the graph would be larger. And as an extra benefit, if graph vertices store references to the byte offsets of a change, we don’t have to store a table mapping line numbers to contents, we can simply load contents from the change files directly.

However, this idea meant that a completely new Pijul had to be written, in order just to benchmark the idea. I wrote the core algorithms during my Christmas holidays at the end of 2019, and the results were quite satisfactory, as I was able to import a large number of large files without a problem. Even better, I was able to graphviz the result.

I wasn’t super confident about how to announce that publicly, since a number of people had started using Pijul, and even though they knew the project was still experimental, this move would be a breaking change in all possible imaginable ways. Moreover, it wasn’t clear to me that I was still legally allowed to work on Pijul (more on this below).

Main changes in the graph structure

We first revisit the first two diagrams in this post: adding and deleting lines. Vertices are now referenced by a change number (for example $c_0$) and a byte interval in that change (for example $[0, n[$, which means “bytes from offset $0$ included to offset $n$ excluded”). Note that vertices can now represent an arbitrary number of lines. Moreover, the size they occupy in memory is independent from $n$ (assuming $n < 2^{64}$).

Starting from a single vertex $c_0:[0, n[$ with $n$ bytes (containing an arbitrary number of lines), introduced by change $c_0$, adding a line is done by first splitting $c_0:[0, n[$ at some offset $i < n$, and inserting the new vertex just like before.

This means in particular that the splitting of content into lines is done by the diff algorithm and is encoded in the changes, instead of being set in stone for all repositories. With a different diff algorithm, we could imagine splitting contents according to programming language structure.

Once we know how to split vertices, deletions are handled almost as before: as shown in the following diagram, where we first apply the same change $c_1$ as in the previous example, and go on to applying change $c_2$, which deletes the end of vertex $c_0:[0, i[$ from some $j<i$ to $i$, and the beginning of vertex $c_1:[0, m[$, from $0$ to some $k<m$.

One important difference with before is that our previous edges had two different roles which were not clearly distinguishable from one another until now. One of these meanings was to order the lines, and the other one was the status. However, now that vertices can be split, the “status” role of edges becomes ambiguous: for example, a deleted edge pointing to vertex some vertex $c:[i, j[$ means that bytes $[i, j[$ of change $c$ are deleted, but what if we split that vertex into $c:[i, k[$ and $c: [k, j[$? Should we add an extra deleted edge, and if so, where?

There is a simple solution: by introducing a new kind of edge label (named BLOCK in the source code) we can distinguish between “internal” or “implicit” edges that are only here to order the blocks, and “status” edges informing about the status of their target vertex. I’ll probably explain more about this in a future blog post, or in the manual, or in a paper.

New change format

Certainly the most breaking change of them all is the new patch format:

New change representation

Changes now have a text representation, aiming at a one-to-one conversion with binary changes. This is a new thing, and is not totally trivial to get right, in particular because one cycle of serialization and deserialization must yield the exact same binary content.

When creating a new change, authors are presented with a draft of the change in the text format (unless some recorded files are detected to be binary files), in which they can (1) edit the headers, (2) delete the changes they don’t want, and (3) add dependencies.

One can also try to edit the changes, but the result is not guaranteed to work. It is, however, one way to test the examples presented in this post, which is especially useful when paired with pijul debug and graphviz.

Version identifiers

Another new idea is that we now have version identifiers. This has always been a difficulty in Pijul, because any two patches $c_0$ and $c_1$ that could be produced independently commute, meaning that applying $c_1$ after $c_0$ yields the same repository as applying $c_0$ after $c_1$.

This means that meaningful version identifiers must be independent from the order. This could be achieved easily with any linear function of the hashes, for example taking the bitwise XOR of hashes. However, some naive schemes would have the problem that servers could trick clients into accepting that versions are synchronised, whereas they are not. The bitwise XOR described above has this problem, and so do other linear functions: since the hashes are random, there is a high probability that any $n$ hashes form an independent linear basis of the vector space of hashes. The problem of forging a version identity becomes easy, as it is just a matter of solving a linear equation.

Our new scheme for version identifiers comes from the discrete log problem. The version identifier of an empty repository is always the identity element $1$, and applying a change with hash $h$ on top of version $V = e^{v}$ is $V^h = e^{v\cdot h}$.

Now, because getting from $V$ to the exponent $v$ is hard (at least for a classical computer), forging a version identity is hard, since we would need to generate hashes whose multiplication is $v$, without even knowing what $v$ is.

I don’t think this scheme is robust to quantum algorithms, though: if you can find $v$ easily, it seems you can solve linear equations on a vector space based on the multiplicative group over $\mathbb N$, and $\mathbb Z/2\mathbb Z$ as the scalars.

New protocol

The last thing I want to talk about today is the new protocol. First, Pijul can work without a central server, and our central hosting service does not need to be open source for that (thanks for asking, and even more thanks for reading my countless previous answers before asking).

One particular thing worth mentioning is how comparisons between remote and local versions is done. In previous version, we would download text files representing a list of changes over SSH or HTTP, and compare the whole sets.

The new way is with a binary search over the network, where the client first asks the server what their last version identifier is, and then performs a binary search to determine the latest version the client knows about. Then, new changes are fully or partially downloaded (depending on their size). Finally, we go through the new vertices that are still alive after applying all the new changes, and download (if necessary) the full changes that introduced these vertices.

A new name?

One common criticism we’ve heard since we started Pijul a few years ago was about the name. I came up with that name, but to be honest, I was more interested in getting stuff to work (which was challenging enough) than in thinking about names at that time.

One suggestion I’ve commonly heard is that maybe we should translate the name to another language. The translation of that word in English is Ani, but the relevant domain names are not available, and the googlability is terrible. Then, Anu is the translation in portuguese, and also a word in many other languages, and is even the name of an antique God in Mesopotamia, which is actually the first result to show up on Wikipedia, along with a nice logo in cuneiform which looks like a messed up commutative diagram.

Anyway, it seems this new name has offended some people. I should have asked more people about it, but in times of lockdown I don’t have many around me. After running a Twitter poll, I’m now convinced that neither name is terrible, and the previous name has the advantage of being almost uniquely googleable, so I’m reverting that change.

What about the Nest?

The old Nest will remain available until the end of 2020 at https://old-nest.pijul.com. The HTTPS certificates are broken at the moment, but SSH will continue to work.

A personal note

All the above has been complicated by the fact that I accepted an academic job in August 2018, only to be told, after I had resigned from my previous job at Inria and moved to another country, that I wouldn’t be allowed to work on my own research projects (including Pijul), or at least that the legal status of that work outside of my holidays was unclear.

For some context, that is not how academia has worked for the last 2500 years, at least not how it should work, and not how it has worked for me for the last 10 years. There is nobody in particular to blame for that, the system is just broken.

It wasn’t optimal, but at least there wasn’t any pandemic around… until there was. I ended up leaving academia (at least temporarily) in July 2020, after securing funding for another exciting project, also involving cool science, but with more freedom.