Two changes to changes

Monday, June 28, 2021
By Pierre-Étienne Meunier

In this post, I announce the final changes to the repository and changes formats before we declare Pijul “1.0 beta”.

A few updates first

A number of things happened since the OVH fire, and I hadn’t had time to comment on them since then:

Two changes to the change format

First, I want to emphasise the fact that these two changes were made, for the first time, in a totally backwards-compatible way, which is in itself a major event in this project, because it means we can now commit to backwards-compatibility.

Non-UTF8 encodings

@Rohan had been working for a while on a really hard project, which was to adapt Pijul’s diff algorithm to various file encoding. This is now finally merged into Pijul!

Pijul is capable of computing diffs on different entities, such as lines (currently the only one implemented), words, or any other split of a file. Note that so far, only lines are implemented, but changing that should be extremely simple (a very good first contribution, if you’re interested).

However, the characters used to represent these splits may change depending on the file encoding, and hence the diff algorithm must take that into account, which is what Rohan’s patches do. An additional complication is that Pijul presents patches as a file to the user, and deciding on a unique encoding for that file, when a project may contain multiple different encodings, is not easy.

The solution Rohan found was to add information about the encoding along with permissions, next to the file name, and transcode with UTF-8 (which is also Rust’s default encoding, making things easier to debug) at the time of presenting the patch file to the user.

One thing this doesn’t handle is collaborators working on the same file using different encodings: not only did we decide that this situation was highly improbable, but this also goes against one core principle of Pijul’s design, which is to represent blocks of bytes only, without trying to give them any interpretation.

Since this required to store extra information about the encoding, we needed a slight change in the format.

Malleable identifiers and patch signatures

The other major format change is related to online identities, and is also backwards-compatible. The precise details of author names weren’t initially the main focus of the project (building an algebra of patches was), but constructive bikeshedding happened on that topic, and we’ve finally merged a new scheme for identities.

From now on, when you make your first record ever using Pijul, we will ask you to generate a cryptographic key to sign your patches (using pijul key generate). Secret keys can already be encrypted, and SSH agents will be supported in the future, without changing the key format.

The “author” field in patches now gives you the choice between a simple free-format string, or your public key, and a mapping between public keys and identities is stored in repositories in .pijul/identities. The default is to use a public key, and ask you to generate one if you don’t have one.

Even though the patch format is now general enough to avoid requiring extra changes (we can use the various “extensions” fields to extend it), the format of the mapping could still change a little bit.

Are we bug-free yet?

I believe we’ve reached a stage where we can guarantee that no data can be lost by the tool. The patches have a clear semantics, which is highly unlikely to change in the near future. About nest.pijul.com, I am also starting to be confident that no more than a few minutes of data can be lost.

One thing that has surprised me in this project is that many people seem to think that this project ought to be always bug-free, even though it was always advertised as “not quite ready”:

A glimpse of the next steps

We’re entering a serious feature-freezing phase, where the only issues now considered for beta will be bugfixes, at least for a while.

The only big thing coming up before the beta is going to be customisable diff algorithms, which will make it easy to work with large binary files, where Pijul patches will be no more costly than synchronising files using rsync. This can be done without any change in the external interface of formats.