Theory

The theory of Pijul is actually quite simple, even though it requires great care in the implementation, in order to get the correct algorithmic complexity.

In Pijul, files are generalised to arbitrary graphs of lines, with different kinds of labels on the edges. In this representation, an edge from line a to line b means that a comes before b in the file. The global data structure in a repository is append-only, in the sense that patches can never delete lines: line deletions from a file are implemented as changes in the labels of the edges pointing to that line.

The hidden Pijul subcommand info (pijul info --debug) dumps the entire graph into a file in graphviz' dot format, one for each branch of the repository, called debug_branch (usually debug_master by default).

Motivation of this representation

This idea is motivated by a notion of category theory called a pushout.

A pushout can be seen as a merge between two operations. Originally, many of the ideas behind the theory of Pijul come from a paper by Mimram and Di Giusto showing how to generalise flat files to make them have all pushouts. In other words, they show that a close relative of the representation of files in Pijul is the smallest generalisation of standard files that handles all the conflicts, modulo some caveats (that paper doesn't handle line deletions or multiple files).

If you want to learn more about category theory, a good starting point is to try and understand the Yoneda lemma, which roughly shows how to relate any category to operations related to sets. This is important in particular because sets are a well-understood category. A few concepts are required to understand that lemma, mostly categories, the important example of the "Set" category, functors and natural transformations between them, and finally presheaves.

Outputting a generalised file

One of the important algorithms in Pijul is the one that outputs generalised files into actual on-disk files.

Conflicting and non-conflicting files

If the two following conditions are both met:

  • the graph is acyclic, and contains a path passing through all the vertices.
  • for each vertex v, all incoming edges of v have the same label ("alive" vs. "deleted").

Then the task of presenting the file to the user is relatively straightforward: just go along the path and output all alive vertices. Else, we say the file has a conflict, and presenting the state of the file to the user in an intelligible way is not obvious.

The major benefit of this approach, compared to the valid-state-only approach of Git, or to the patches-only approach of Darcs, is that conflicts are just like any normal state. If we remove one side of the conflict (for instance using unrecord, the conflict disappears. If we create a patch that solves a conflict in one context, the same patch will solve the same conflict in any other context. This implies in particular that conflicts do not reappear.

Zombies

When Alice deletes a block of text, and Bob writes inside that same block in parallel, the merge results in a "deleted context" when Alice tries to apply Bob's patch. On Bob's side, things are a little more complicated, because the situation looks no different than if Alice had been aware of Bob's line when she deleted her lines.

In this case, Pijul detects the conflict on both sides, and reconstructs a minimal context of one line around Bob's line, and produces completely equivalent repositories on each side (which is one of the fundamental promises of patch commutation). Because these context lines will now have both alive and deleted incoming edges, they are called "zombies" in Pijul's source code.

Solving conflicts

Solving "zombie conflicts" is relatively straightforward, as we just have to produce a patch setting all the labels of incoming edges to a vertex to the same value.

However, in order to solve ordering conflicts, we need to add new edges. For complicated reasons related to algorithmic complexity, Pijul chooses to break such an edge in two halves, adding a new vertex in the middle. This also coincidentally yields a simpler patch format, since patches that just add vertices and change edge labels are enough.

Cycles

Sometimes, solving a conflict involves ordering lines that were previously unordered. If Alice and Bob have a conflict between two lines A and B, and solve the conflict with opposite orders (i.e. Alice writes AB and Bob BA), this union of their edits will be cyclic, since Alice adds an edge from A to B, and Bob adds an edge from B to A.

Producing a patch

Producing a patch is done by comparing an actual file on disk with the output of a generalised file. This is an algorithmically hard problem (actually NP-complete) when the number of patches involved in the conflict increases. One major source of hardness is that the user might want to change the order between sides of a conflict, another one is that different sides might have a line with the same contents, and we might be forced to choose just one of the sides.

Pijul solves this by using an Approximation algorithm, with the effect of producing minimal patches only up to a constant factor when solving a conflict, instead of the absolute minimum.

In addition to that difficulty, the are a number of other challenges when implementing the diff algorithm for conflicts: how do we detect that a conflict has been solved? that one side has been deleted, but not the others? that a side has been edited?

Applying and unapplying patches

Compared to the previous tasks, this is relatively easy: applying a patch is almost like computing the union of two sets, and patches contain enough information to make that operation reversible.

However, in order to get the correct algorithmic complexity, there are a few things to take care of. In particular, when a patch deletes lines, Pijul introduces extra edges to "jump over" deleted parts. Thanks to this trick, the complexity of the outputting algorithm (and therefore also of diff) does not depend on the size of the repository's history.

When unapplying a patch that deleted lines, we have to make sure that these extra edges are deleted, and we might sometimes need to add new ones to connect the newly undeleted lines to other alive parts of the graph, for else they might be skipped by the outputting algorithm.