Commutation and scalability

Sunday, December 20, 2020

By Pierre-Étienne Meunier

I just finished the implementation of an important feature of Pijul: clones, pushes and pulls on partial repositories. In this post, I explain why this matters.

Change commutation

Pijul is based on changes, also called patches or diffs. This doesn’t mean that its only internal datastructure is patches, quite to the contrary: it was only by departing from a patch-only internal representation that we were able to solve the algorithmic challenges inherent to patch-based systems.

However, being change-based does mean that the core operations of Pijul are defined on changes, and that Pijul is designed in such a way that changes satisfy basic intuitive properties, similar to algebraic operations. One basic thing is that applying a change is an associative operation, like matrix multiplication: applying $A$ and $B$ at once, and then later $C$, is the same as applying $A$, and then $B$ and $C$ at once. In matrix multiplication, $(AB)C = A(BC)$. Moreover, in Pijul, all changes are invertible (whereas only some matrices are): for any change $A$, there is an “inverse change” $A^{-1}$ such that applying $A^{-1}$ after $A$ is the same as applying neither. Of course, both $A$ and $A^{-1}$ will appear in the log, but the contents of the repository will be the same as applying neither $A$ nor $A^{-1}$.

There is another property that users want from version control systems, and that is commutation. Matrix multiplication rarely commutes¹. In Git, commutation is usually simulated using branches and rebase: indeed, rebasing a branch A on top of another branch B really means commuting the commits of A since the divergence between A and B, and the commits of B since the divergence. However, that commutation isn’t perfect, since the commits must change their hash when rebased.

Things are simpler in Pijul, because any two changes that could have been written independently always commute, meaning that if the two changes could be written without knowledge of each other, they can be applied in any order.

This of course raises a potential concern:

If I can apply $A$ and $B$ in any order, how do I know which order is the right one?

In Pijul, this question does not matter: both orders will yield the exact same result, the only difference is that the log will list the changes in the order in which they were applied. And I’m not saying that it doesn’t matter because I’m careless, but because it truly is the same thing.

I disagree: it does matter, I still prefer to have a “linear” order for my changes/commits.

Indeed, everybody wants to see the order of operations in a repository, for many reasons. For example:

We want to keep a record of the operations performed on our repository.
We want to go back in time.

And in fact, Pijul allows you to do exactly that, but in a more rigorous way than Git. Indeed, take the scenario where Alice and Bob work together, Alice makes a change $A$ while Bob makes $B$. When they put their work together, Alice applies Bob’s change, resulting in the log $AB$, while Bob applies Alice’s change, resulting in the log $BA$. In this case, there is no “true” linear history, since they worked on different things, and took different steps at different times. However, both of them want to be able to go back in time, step-by-step, and not just “step-by-step-according-to-Bob’s-order”.

Commutation and massive repositories

One of the biggest challenge for Pijul up to the recent releases was scalability: even modestly-sized repositories like Pijul’s source code would use a lot of disk space. This was even more disappointing since, as I’m about to explain, commutation was supposed to allow it to scale to gigantic repository sizes… in theory. The same problem also made massive tests impossible, meaning that getting past the “0.x releases” seemed more and more impossible as time passed.

Now that this phase is mostly behind us, the cool bits of the theory finally become practical.

In particular, in Pijul, each change contains a reference to the files it modifies. Note that, because we want operations on repositories (such as renaming files) to commute with edits inside files, we don’t identify files and directories by name, but by a unique identifier made from the hash of the change that introduced that file or directory.

For repositories with multiple projects, this makes it possible to clone and pull just parts of a repository, and work on that part as if we had the entire thing. Indeed, imagine we have a repository with the following log:

A, adding file x
B, editing x
C, adding file y
D, adding file z
E, renaming x to w
F, editing y and z

All these changes do not necessarily commute; however, since B and C touch completely different files (namely, x and y), they could be produced in parallel, and hence they commute. This means that we can pull only the changes related to a specific file, say x, and make the following history: A, B, E².

Moreover, any change made on top of that sequence will commute with C, D and F: indeed, if we edit file x again, producing a change G, then since G can be made in parallel to C, D and F, we can push G after any patch that comes after B in that history, for example getting history “ABCDEFG”.

Note that this is done without changing the changes nor their hash.

Another trick for large files

As explained in previous posts on this blog, Pijul changes have a bit more information than diffs, and operate on graphs rather than files. This means that changes can be split into two parts, a short-ish one with a binary specification of the graph operations, and then the new content inserted by the change.

The change format is designed to be downloadable in two stages: one can download the operations without downloading the contents. One issue with this is security: if we don’t download the contents, how can we make sure that the hash is right? This is done by including a hash of the change in the “operations” section of the change, and letting the hash of a change be the hash of the “operations” section.

This makes it possible for someone to make five versions of a large binary file in a day, where each change deletes the entire file, and adds it again, the operation sections only contain the length of the different versions, not the actual bytes. In order to get the latest version of the file, a client will therefore only have to download the latest change completely, and only the operations section of the previous ones.

Note that this makes the following “attack” possible: a server might trick a client into believing that the server has a change with hash $A$, which inserts $n$ bytes into a file, and then another change with hash $B$, deleting all these bytes. When the client downloads $A$ and $B$, it doesn’t need to download the contents. However, if the client later decides to unrecord $B$, the contents of $A$ will have to be downloaded, and the client will be able to tell that the hash of $A$ was incorrect. This should make it impossible to unrecord $B$ without also unrecording $A$. However, if $A$ made other edits, and other changes depend on $A$, this could be problematic.

What is next?

There is one remaining painpoint for very large repositories, and this is the fact that in order to clone a repository with a very large history, one must download and apply all the changes, one by one. Even though our apply function is quite fast³, this can still be problematic for dozens or hundreds of thousands of changes.

In my next post, I will talk about a solution to this problem, which I have started to implement.

Matrices that are simultaneously diagonalisable do, for example, but for two arbitrary matrices $A$ and $B$, $AB$ is often different from $BA$. ↩︎
Note that in this case, we could also pull A, E, B, since renaming a file commutes with editing it. However, we must start with A, since adding a file could not be possibly done in parallel to editing or renaming it. ↩︎
The complexity of apply is in $O(|p| |c| \log |H|)$, where $|p|$ is the size of the change, $|c|$ is the size of the largest conflict in which $p$ is involved, and $|H|$ is the number of edits made since the start of the repository. ↩︎