Tuesday, April 23, 2019

By Pierre-Étienne Meunier

We released Pijul 0.12 yesterday. This is a really exciting time for the project, as we have reasons to believe that all the algorithms are now in a pretty good shape.

A lot of things changed in this release, but I’ll focus on four:

A good diff algorithm is essential to version control. Diff is an algorithm that produces a *patch*. Often, we’re interested in a smallest patch.
Note that smallest patches are not necessarily unique: while this is fine for producing patches, the lack of unicity might cause problems when using the same algorithm to merge changes (which is what diff3 does).

The diff algorithm used in Pijul is a little tricky to implement, because of conflicts: the main question is, how should we record the fact that a user has removed conflict markers? How to detect that a conflict has been solved? What if a user edits one side of the conflict, while leaving the conflict open?

For that reason, Pijul was using a naive diff algorithm until (and including) version 0.11. That algorithm uses a very simple variant of dynamic programming, and runs in O(*n* *m*) time and memory, where *n* and *m* are the sizes of the files we are comparing.

After an important refactoring, the diff algorithm was split from the main code, into a separate crate called *diffs*, which allowed me to implement other diff algorithms, such as Myers diff, which runs in O((len(*n*)+len(*m*))* *d*), where *d* is the size of the patch produced, and patience diff, which is not really a diff algorithm but rather a way to run an existing diff algorithm on a file.

Both Myers diff and Myers-based patience diff are now usable in Pijul.

I didn’t really know Myers diff in detail before considering it for Pijul, but if I had to start teaching algorithms again some day, this would be a pretty good example of how to go from an obvious dynamic programming algorithm to a linear time algorithm, applied to a real world example where it actually matters a lot. In particular, the way Myers diff computes the cost of the optimal solution is relatively straightforward, but finding the solution itself using only linear memory is quite elegant. If you want to learn more about it, I found this explanation to be pretty good.

Outputting a graph into a file (which we sometimes call “flattening” a graph) has always been a painpoint in Pijul, for three reasons:

- defining what the user wants to see when there is a conflict is not easy. In particular, some conflicting situations are tricky, and we want to output them in a way that is useful to the user, and is consistent.
- the patch application algorithm is not deterministic: it sometimes yields equivalent, but different, graphs. Moreover, the graph sometimes gets simplified after we output it. However, the outputting algorithm
*must*be 100% deterministic, since it is used in diff: if the flattening algorithm could yield different outputs on equivalent graphs, we could sometimes have non-empty diffs just after a clone! - these algorithms are not part of the “cool new theory” behind Pijul, so it was hard to give them as much attention as the rest when “the rest” was still under construction.

The algorithm in Pijul 0.11 included:

- Tarjan’s strongly connected component algorithm, to produce a directed acyclic graph (DAG) of strongly connected components (yes, some patches can yield cyclic files in Pijul!).
- A depth-first search (DFS) to find forward edges, simplify them away, and detect conflicts (using cross edges).
- A stack-based algorithm to output conflicts based on the cross-edges found in the DFS.

This worked relatively well, until some bizarre conflicts started to appear due to a bug in Pijul, and this flattening method used to output them in a way that made them hard or impossible to solve. Indeed, the diff algorithm must have a good enough output of a conflict to be able to record its resolution.

The new outputting algorithm uses just Tarjan’s algorithm, and a DFS that reduces conflicts as it goes. First, forward edges are fairly easy to detect and skip. Then, when we are done exploring all distinct children of a node, we look at where all the resulting branches end. First, when we reach the last line before a conflict, the first path we explore from there will necessarily end at a lower position in the graph than the other paths explored from that same line. Moreover, more than one nested conflict might start from that line.

Pijul’s new strategy is to use a datatype called `PathElement`

that can contain either a single `SCC`

, or a `Conflict`

. When we are done exploring all the sides of a conflict, we group them by their last point, and move them to the side of the conflict that contains their end point, potentially converting a number of `SCC`

on that side into `Conflict`

. When we are done with this, only one path should remain (if it is ever the case that multiple sides remain, then we create a new conflict for them. I don’t know whether or how this could happen).

Another important novelty in Pijul 0.12 is that we started using PGP to sign patches. I want to thank the Sequoia PGP team for making this possible. I know from first-hand experience that implementing PGP is hard, and they are doing great work. This means that the commands for `pijul key gen`

and `pijul key upload`

have been slightly altered. Existing PGP keys can be used, but maybe not in their full generality (for instance, smart cards and agents are not yet supported).

An important bug was found in Sanakirja, and has been fixed. I took the opportunity to implement Sanakirja’s full concurrency model, and updated my previous blog post today to explain more about this.