In this post, I announce the final changes to the repository and changes formats before we declare Pijul “1.0 beta”.
A number of things happened since the OVH fire, and I hadn’t had time to comment on them since then:
Thanks to @tankf33der’s restless testing, we’ve caught three bugs in Sanakirja, which could cause data corruption on very large instances. If you are using Sanakirja, you should make sure that you’re using version 1.2.7
(or later) of crate sanakirja-core
.
We now have an efficient implementation of binary diffs, based on a rolling checksum (like rsync) and my diffs
crate on the result. This isn’t yet integrated into Pijul, but will soon be.
The performance of a number of commands (including pijul record
and pijul unrecord
) has improved a lot in the last few months, thanks to benchmarking and various algorithmic tricks. I’ll publish benchmarks in the next few weeks.
Commands to output a repository (including pijul reset
and pijul channel switch
) are now able to output files in parallel. pijul record
also has support for running in parallel, but this is turned off at the moment, until we find a solution to make the resulting patches fully deterministic.
Tags finally got their proper implementation: they can be used with the pijul tag
subcommand, and provide a way to navigate through very large histories, which until now was only possible via very costly chains of unrecords. The implementation is based on a trick in the design of Sanakirja 1.0, allowing one to read compressed databases very efficiently. I’ll probably blog about that bit of our technology at some point.
OpenSSL is now optional in the latest versions of Thrussh (meaning that key algorithms that depend on OpenSSL won’t work without the feature), and @darleybarreto has offerred to help with better integration with the Rust-Crypto project. If that works, and we can bring *ring* back in again (which is hard at the moment because of the legacy key formats needed to parse SSH keys), then we’ll be very close to a pure Rust (+ASM) crypto backend, which will hopefully be easier to install on all platforms.
First, I want to emphasise the fact that these two changes were made, for the first time, in a totally backwards-compatible way, which is in itself a major event in this project, because it means we can now commit to backwards-compatibility.
@Rohan had been working for a while on a really hard project, which was to adapt Pijul’s diff algorithm to various file encoding. This is now finally merged into Pijul!
Pijul is capable of computing diffs on different entities, such as lines (currently the only one implemented), words, or any other split of a file. Note that so far, only lines are implemented, but changing that should be extremely simple (a very good first contribution, if you’re interested).
However, the characters used to represent these splits may change depending on the file encoding, and hence the diff algorithm must take that into account, which is what Rohan’s patches do. An additional complication is that Pijul presents patches as a file to the user, and deciding on a unique encoding for that file, when a project may contain multiple different encodings, is not easy.
The solution Rohan found was to add information about the encoding along with permissions, next to the file name, and transcode with UTF-8 (which is also Rust’s default encoding, making things easier to debug) at the time of presenting the patch file to the user.
One thing this doesn’t handle is collaborators working on the same file using different encodings: not only did we decide that this situation was highly improbable, but this also goes against one core principle of Pijul’s design, which is to represent blocks of bytes only, without trying to give them any interpretation.
Since this required to store extra information about the encoding, we needed a slight change in the format.
The other major format change is related to online identities, and is also backwards-compatible. The precise details of author names weren’t initially the main focus of the project (building an algebra of patches was), but constructive bikeshedding happened on that topic, and we’ve finally merged a new scheme for identities.
From now on, when you make your first record ever using Pijul, we will ask you to generate a cryptographic key to sign your patches (using pijul key generate
). Secret keys can already be encrypted, and SSH agents will be supported in the future, without changing the key format.
The “author” field in patches now gives you the choice between a simple free-format string, or your public key, and a mapping between public keys and identities is stored in repositories in .pijul/identities
. The default is to use a public key, and ask you to generate one if you don’t have one.
Even though the patch format is now general enough to avoid requiring extra changes (we can use the various “extensions” fields to extend it), the format of the mapping could still change a little bit.
I believe we’ve reached a stage where we can guarantee that no data can be lost by the tool. The patches have a clear semantics, which is highly unlikely to change in the near future. About nest.pijul.com, I am also starting to be confident that no more than a few minutes of data can be lost.
One thing that has surprised me in this project is that many people seem to think that this project ought to be always bug-free, even though it was always advertised as “not quite ready”:
First, for many people (including myself), version 0.x usually means “in development” or “experimental”, and so does “alpha” in a version number. I usually don’t expect projects with these labels to be anything other than new, possibly ambitious, and certainly buggy. If you’re the kind of folks who thinks such stages of a project are “comical”, I totally respect that feeling, but I recommend not using software with these mentions for purposes other than a good laugh. In the case of Pijul, you might want to wait until a firm 1.0 (not alpha, nor beta, nor 0.x).
Second, the part of the stack we needed to write, and now need to maintain, is a bit thicker than I’m used to:
One major source of bugs has definitely been our storage engine, Sanakirja, which is now quite stable, and is starting to get used in other projects as well. Despite the relatively rough road towards building Sanakirja, I do consider its current state of stability as an achievement of its own, independently from the broader Pijul context. For a refresher, I wrote about Sanakirja in a blog post earlier this year.
The reason Sanakirja was so hard to get right is that it has a complicated memory management algorithm, with a large number of cases. In short, Sanakirja is an on-disk, transactional key-value store with O(log n) clones. As such, it needs to allocate and free pages on the disk, while keeping track of reference-counters, in such a way that a panic in the program, or “pulling the plug”, doesn’t corrupt the storage, instead simply not committing the current transaction.
It would have been hard to get all the test cases and interactions with potential users without writing a hosting platform like nest.pijul.com, and we desperately needed real-world test cases in addition to the many automated checks. Many of these test cases revealed ambiguities about how Pijul worked, and led to constructive discussions. But on the other hand, that also meant writing an SSH library able to write servers. That library is now being used in a number of other projects for various purposes (not limited to SSH), which is nice, but since we also had to follow the many iterations of the async/Tokio ecosystem since 2016 (Thrussh started before Tokio, and initially just used the lower-level Mio), which involved heavy refactoring with each change in that ecosystem.
Fires in datacenters don’t help much in getting a project going. After I rebuilt the Nest following the OVH fire in March, I worked on replicating repositories on multiple servers in different places. This wasn’t easy, as there is no off-the-shelf tool to replicate Pijul repositories in an efficient way. That situation led to a number of small synchronisation glitches over the last few weeks: to give just one example, some patches wouldn’t replicate to a server that had been considered “under attack” by the cloud provider in the few minutes before the replication. This wasn’t easy to debug (at least not without targetting a DDoS at my own server), but I believe these difficulties are behind us now.
Third, this project is designed to work on all platforms, but the diversity of encodings and platform specificities is massive, and not all platforms implement their own documentation (for example, WSL still doesn’t implement its own documentation for mmap, even though mmap is arguably the most fundamental Unix system call).
Finally, unlike a seemingly common opinion, there is no professional, permanent “team” behind this project, even though I did dedicate a few months full-time to the project at the end of 2020, while between jobs.
We’re entering a serious feature-freezing phase, where the only issues now considered for beta will be bugfixes, at least for a while.
The only big thing coming up before the beta is going to be customisable diff algorithms, which will make it easy to work with large binary files, where Pijul patches will be no more costly than synchronising files using rsync. This can be done without any change in the external interface of formats.