I’m proud of finally announcing the beta release of Pijul, after a bit more than a year of alpha. Sorry for the long post, and Happy New Year!
Pijul has come a long way since the initial alpha release, in terms of performance, stability and features. Here are the most notable achievements since the 1.0.0-alpha release in November 2020:
Malleable identifiers make signed patches the default, while allowing users to later change their personal details (email address, name, login…). The full story for synchronising these identifiers from multiple different source repos is not yet completely written. Testing and suggestions are welcome!
Merging unrelated repos and partial clones are designed to make very large projects more manageable. I’d like to work on a number of useful derivatives of that idea in 2022 (more on that in a future post).
Tags are a compressed version of a repository at some point in time. They are an efficient way to store repositories, but do not yet contribute to making writable repos smaller (but they will!). Their main use at the moment is to make it efficient to browse a particular state of a repository.
This is actually a really cool feature: since the early days of patch-based version control (and for example in Darcs), there has always been a trade-off between the ease of use and intuitiveness of patches and the navigability of snapshots. The current implementation of tags in Pijul are a first step towards getting rid of that trade-off.
Releasing a beta version obviously doesn’t mean that Pijul is completely bug-free. However, there are a number of claims we can finally make, after a year of public alpha:
We can now promise that the formats won’t change, at least not in a very long time. This is because all the core features we wanted for 1.0 are implemented. Incidentally, all format changes of the last six months (required to implement new features) have all been backwards-compatible.
Our storage backend, Sanakirja, has now been tested at massive scale, including using new datastructures unrelated to Pijul. Despite our thorough checks, we haven’t seen any serious bug in months.
Our algorithms for generating and applying patches have been both tested at scale with automated checks (using the
pijul git command to import Git repositories), and manually for each change that was recorded, reviewed and applied to the Pijul repository itself. Moreover, the test suite for Libpijul (where the algorithms are) includes a number of pathological cases, and the outcome of each algorithm has been inspected manually in those tests.
First reason: the underlying storage layer was hard to write. Writing any efficient storage format is by definition hard. Examples include filesystems, which must avoid corruption and protect against hardware errors. Databases are even worse, since they must additionally implement all sorts of fast queries and efficient storage.
There are mainly two reason the basic layers of these systems were/are hard to write:
Second reason: the upper-layer algorithms need an absolutely perfect storage layer, yet are also essential to test that layer. This is a chicken-and-egg problem, since corruption in the base layer can yield arbitrary, unpredictable results anywhere, yet the base layer cannot be tested without writing and debugging these algorithms.
Third reason: alignment between maths and UX. While the mathematics at the foundation of Pijul have the potential to deliver a simpler, more intuitive UX, users rarely want to know about the reasons for different design choices, since these reasons can sometimes be quite technical. A number of iterations have been spent on aligning the design with the feedback we’ve received.
Until quite recently, performance was not our main focus: features and stability were. These obviously had an impact on performance, since the main goal in designing the formats was to make them optimally fast and small, even if the implementation wouldn’t follow immediately.
Many benchmarks have been designed and suggested by @tankf33der, we show some of them below. The timings shown here were recorded on an Intel Xeon E3 v5, which is a 4 physical core CPU (8 with hyperthreading) running between 2.80GHz and 3.70GHz on a 2016 laptop. RAM was never an issue. Pijul was compiled with ZStd 1.4.9, since ZStd 1.5 has important performance regressions.
pijul git command goes through several stages, and is essentially a breadth-first search of the commit DAG.
Load the entire history graph into the main memory to find all initial commits (many smaller repos have only one initial commit).
Create a channel for each initial commit. Initialise a todo list with one “todo item” for each of these commit/channel pair.
Then, repeat the following until the todo list is empty:
Pop the most urgent commit/channel pair from the todo list, calling the commit
c and the channel
Check that all of
c’s parents have been imported:
land move to the next step.
(c, l)at the end of the todo list and continue the loop (i.e. move back to step 1 of the loop).
c has more than one parent, prune the channels where these parents were imported, since they are no longer in the todolist, but are still alive in the repository.
git reset to
pijul channel switch to the channel if needed,
pijul add all the files and run
c has a single child
(d, l) as the most urgent item on the list.
n > 1 children. Fork channel
n-1 times, placing each of the children along with one fork at the top of the todolist.
Vim is a text editor with a history of a relatively modest size for an open source project, with 15138 commits to import in order to get to the HEAD of the main branch. Note that Pijul records all commits on all forks that were later merged onto main as independent patches, and doesn’t record trivial merges (merges without conflicts nor any modification after the merge itself).
The following two graphs show the time taken by
pijul record and
pijul apply in the algorithm for
pijul git described above, for the 15138 commits in the Vim repo.
Only one commit isn’t shown, it is commit
6bb683663ad7ae9c303284c335a731a13233c6c2 in the repository, which “changes” 1 the indentation of some huge dictionary file. Its import took 5 minutes and 16 seconds.
Note the log scale, chosen to avoid showing only the extremely short times, which are by far the most common.
In this case, we imported 70859 commits to reach the HEAD of the main branch. Only three commits took too long to make sense in the histograms. In chronological order:
a388c7dd9e15d8b25705951d8906eacf76f50d7b, alomst 25s to record, and almost 408s to apply. That commit does contain very large diffs, but the apply time is quite extreme and prompts investigation, especially since it is 16 times higher than the record time, which is the opposite of the usual case.
4be11cde44351c109eaf07669046a2152f151c78, almost 190 seconds to record and about 36.5s to apply. That commit also contains huge diffs (161k lines of diff, according to GitHub).
beafa477f1b48204202dfcf5f13b2b0cc216f732, just over 100 seconds to record and 8s to apply. That commit also contains large diffs (49k lines of diff, according to GitHub).
We stopped the import after 52664 commits. That repository had consistently low apply times, and a large number of high record times (913 of them, or 1.7% of the total, took more than 5s). We believe this could be explained by a different workflow that was used to produce these commits: indeed, that repository has a rather large number of things happening in parallel, resulting in frequent channel switches in our import algorithm. Since switching channels causes a number of files to be written,
pijul record recomputes the diff of all these files. This was confirmed by the few commits we have manually inspected, where the diffs were actually really small and very similar to the result of
git diff (the only difference being on the extra labels Pijul puts on conflict resolutions), yet many files were compared, only to find that they hadn’t changed.
An example of a commit that took a long time (48.7s) was
fef67eefd3f91ae562c4fdd4a0051da28f7919dc, which apparently changed no file at all, causing Pijul to diff all 3706 files of the entire repository.
This also warrants further manual inspection, but doesn’t seem to penalise real-world workflows driven by humans. If this hits you, please open a discussion on the Nest describing the exact performance problem you’re facing, along with steps to reproduce.
We believe the data shows Pijul to be usable even on large histories. Monorepos for large corporations are another story, but raw import speed is probably not the relevant metric in that case: partial clones, commutation and multi-root repos can probably solve many scalability problems.
Of course, these benchmarks show that there is still space for optimising and parallelising our algorithms. Anyone interested in helping is welcome, one good first step is joining our Zulip.
Our hosting platform, the Nest is now much more robust than it was a year ago, thanks to all the improvements in Libpijul, in addition to standard and nonstandard distributed computing tricks, including:
a replicated database (the “standard” part), which uses a combination of the improvements in the streaming replication features in PostgreSQL, together with Etcd, which is meant as a configuration tool, but can also just be used for its implementation of the Raft protocol to run leader elections, and finally some glue between these which for some reason took forever to debug.
replicated repositories using the CRDT nature of Pijul. This was prompted by one of the two fires I went through in 2021 (the other one was in my apartment and had no impact on Pijul). This was particularly hard to implement, since the testing cycle involves many tests on local machines, followed by multiple careful redeployments in production.
The monitoring tools have shown for some months now that the replication happens consistently, with no major bug in the repository replication.
However, improving the Nest was not the priority until very recently, explaining the unpolished feel and the many bugs: indeed, many “non-core” features haven’t been given all the care they deserve, yet they are often the first thing users see or experience. Now that Libpijul is finally getting stable, this situation is ready to change. If you experience any issue, please contact us.
The Nest was initially an ambitious bet to try and grow a community around our tool, which I saw as necessary to get as much feedback as possible from actual users working on non-test projects. Dogfooding Pijul has also probably been the most useful test we could dream of. Another goal of the Nest was to allow users with widely different technical backgrounds to start experimenting with Pijul without having to setup their own server. While the Nest’s features are quite limited, especially in comparison with all the cool UIs that have been written for Git, I believe it reached its goal, as evidenced by the impressive number of productive discussions opened on Pijul itself, and by the repositories created by users.
Here are the next few steps for Pijul itself:
Converting Pijul repositories to Git, or at least do something to make it easy to switch back and forth. This is significantly more involved than a Git/SVN gateway, since Git and Pijul work in fundamentally different ways, whereas SVN’s data model of SVN can be seen as a special case of Git’s (SVN has a DAG of commits that is a line graph).
Indeed, in our case, converting from Git to Pijul is easy, and so is converting from Pijul to Git. However, making these two conversions work together is tricky, since commits cannot be produced from patches independently from other patches applied to the repository.
One issue if Pijul fails to recognise a commit
a as coming from a patch
b already applied to the repository, is that Pijul could import
a a second time, creating a patch
c with the same content as
b, but with a different hash. The problem is that applying
c in the same repo could create lots of unintelligible conflicts.
There could be nice algorithmic solutions to that problem. If you’re interested in helping, please join our Zulip.
Some users working on large binary files (video game projects, for example) have asked for file locking. Pijul already has specific features for these files to:
Since files are actually modelled in Pijul, locking is quite easy to implement. Again, anyone needing mentoring to work on that feature is welcome to join our Zulip.
Symbolic links aren’t properly handled yet, but shouldn’t be too hard to add. One reason they haven’t been implemented yet is that conflicts involving them could be handled in a non-naïve way to make them intuitive.
As the Nest improves, we’ll soon be able to restart our CI and offer private repositories and a more professional service and support. If you’re interested in custom setups of the Nest, please contact us.
I’m also thinking about other research projects based on this new stable version of Pijul, I’ll make sure to post about them when I have some code to show.
Of course, commits don’t “change” anything, they are just a version such that one possible diff with the previous version is the one shown on that GitHub page. ↩︎