Announcing Pijul 1.0 beta

Tuesday, January 18, 2022
By Pierre-Étienne Meunier

I’m proud of finally announcing the beta release of Pijul, after a bit more than a year of alpha. Sorry for the long post, and Happy New Year!

53 versions of Libpijul 1.0.0-alpha

Pijul has come a long way since the initial alpha release, in terms of performance, stability and features. Here are the most notable achievements since the 1.0.0-alpha release in November 2020:

Stability guarantees

Releasing a beta version obviously doesn’t mean that Pijul is completely bug-free. However, there are a number of claims we can finally make, after a year of public alpha:

Why was Pijul hard to stabilise?

First reason: the underlying storage layer was hard to write. Writing any efficient storage format is by definition hard. Examples include filesystems, which must avoid corruption and protect against hardware errors. Databases are even worse, since they must additionally implement all sorts of fast queries and efficient storage.

There are mainly two reason the basic layers of these systems were/are hard to write:

  1. Debugging is hard: corruption usually only happens when the scale of data becomes large enough to make bugs hard (often meaning “long”) to reproduce.
  2. “Partially correct” versions of most apps might still be usable spite of occasional bugs. Often, restarting the application or reloading some page does the trick. In the case of storage systems, everything is stateful, and no amount of “reloading the page” can solve a corruption issue.

Second reason: the upper-layer algorithms need an absolutely perfect storage layer, yet are also essential to test that layer. This is a chicken-and-egg problem, since corruption in the base layer can yield arbitrary, unpredictable results anywhere, yet the base layer cannot be tested without writing and debugging these algorithms.

Third reason: alignment between maths and UX. While the mathematics at the foundation of Pijul have the potential to deliver a simpler, more intuitive UX, users rarely want to know about the reasons for different design choices, since these reasons can sometimes be quite technical. A number of iterations have been spent on aligning the design with the feedback we’ve received.

Performance guarantees

Until quite recently, performance was not our main focus: features and stability were. These obviously had an impact on performance, since the main goal in designing the formats was to make them optimally fast and small, even if the implementation wouldn’t follow immediately.

Many benchmarks have been designed and suggested by @tankf33der, we show some of them below. The timings shown here were recorded on an Intel Xeon E3 v5, which is a 4 physical core CPU (8 with hyperthreading) running between 2.80GHz and 3.70GHz on a 2016 laptop. RAM was never an issue. Pijul was compiled with ZStd 1.4.9, since ZStd 1.5 has important performance regressions.

Benchmark methodology: importing Git repositories

The pijul git command goes through several stages, and is essentially a breadth-first search of the commit DAG.

  1. Load the entire history graph into the main memory to find all initial commits (many smaller repos have only one initial commit).

  2. Create a channel for each initial commit. Initialise a todo list with one “todo item” for each of these commit/channel pair.

  3. Then, repeat the following until the todo list is empty:

    1. Pop the most urgent commit/channel pair from the todo list, calling the commit c and the channel l:

    2. Check that all of c’s parents have been imported:

      • If so, apply them all to channel l and move to the next step.
      • If not, replace (c, l) at the end of the todo list and continue the loop (i.e. move back to step 1 of the loop).
    3. If c has more than one parent, prune the channels where these parents were imported, since they are no longer in the todolist, but are still alive in the repository.

    4. git reset to c, pijul channel switch to the channel if needed, pijul add all the files and run pijul record.

    5. If c has a single child d, place (d, l) as the most urgent item on the list.

    6. Else, c has n > 1 children. Fork channel l n-1 times, placing each of the children along with one fork at the top of the todolist.

Importing Vim

Vim is a text editor with a history of a relatively modest size for an open source project, with 15138 commits to import in order to get to the HEAD of the main branch. Note that Pijul records all commits on all forks that were later merged onto main as independent patches, and doesn’t record trivial merges (merges without conflicts nor any modification after the merge itself).

The following two graphs show the time taken by pijul record and pijul apply in the algorithm for pijul git described above, for the 15138 commits in the Vim repo. Only one commit isn’t shown, it is commit 6bb683663ad7ae9c303284c335a731a13233c6c2 in the repository, which “changes” 1 the indentation of some huge dictionary file. Its import took 5 minutes and 16 seconds.

Note the log scale, chosen to avoid showing only the extremely short times, which are by far the most common.

Importing Ruby

In this case, we imported 70859 commits to reach the HEAD of the main branch. Only three commits took too long to make sense in the histograms. In chronological order:

Importing CPython

We stopped the import after 52664 commits. That repository had consistently low apply times, and a large number of high record times (913 of them, or 1.7% of the total, took more than 5s). We believe this could be explained by a different workflow that was used to produce these commits: indeed, that repository has a rather large number of things happening in parallel, resulting in frequent channel switches in our import algorithm. Since switching channels causes a number of files to be written, pijul record recomputes the diff of all these files. This was confirmed by the few commits we have manually inspected, where the diffs were actually really small and very similar to the result of git diff (the only difference being on the extra labels Pijul puts on conflict resolutions), yet many files were compared, only to find that they hadn’t changed.

An example of a commit that took a long time (48.7s) was fef67eefd3f91ae562c4fdd4a0051da28f7919dc, which apparently changed no file at all, causing Pijul to diff all 3706 files of the entire repository.

This also warrants further manual inspection, but doesn’t seem to penalise real-world workflows driven by humans. If this hits you, please open a discussion on the Nest describing the exact performance problem you’re facing, along with steps to reproduce.

Conclusion

We believe the data shows Pijul to be usable even on large histories. Monorepos for large corporations are another story, but raw import speed is probably not the relevant metric in that case: partial clones, commutation and multi-root repos can probably solve many scalability problems.

Of course, these benchmarks show that there is still space for optimising and parallelising our algorithms. Anyone interested in helping is welcome, one good first step is joining our Zulip.

Hosting repositories

Our hosting platform, the Nest is now much more robust than it was a year ago, thanks to all the improvements in Libpijul, in addition to standard and nonstandard distributed computing tricks, including:

The monitoring tools have shown for some months now that the replication happens consistently, with no major bug in the repository replication.

However, improving the Nest was not the priority until very recently, explaining the unpolished feel and the many bugs: indeed, many “non-core” features haven’t been given all the care they deserve, yet they are often the first thing users see or experience. Now that Libpijul is finally getting stable, this situation is ready to change. If you experience any issue, please contact us.

The Nest was initially an ambitious bet to try and grow a community around our tool, which I saw as necessary to get as much feedback as possible from actual users working on non-test projects. Dogfooding Pijul has also probably been the most useful test we could dream of. Another goal of the Nest was to allow users with widely different technical backgrounds to start experimenting with Pijul without having to setup their own server. While the Nest’s features are quite limited, especially in comparison with all the cool UIs that have been written for Git, I believe it reached its goal, as evidenced by the impressive number of productive discussions opened on Pijul itself, and by the repositories created by users.

What’s next?

Here are the next few steps for Pijul itself:

I’m also thinking about other research projects based on this new stable version of Pijul, I’ll make sure to post about them when I have some code to show.


  1. Of course, commits don’t “change” anything, they are just a version such that one possible diff with the previous version is the one shown on that GitHub page. ↩︎