I gave a talk at GOTO Aarhus yesterday, where I announced a change of direction for the Nest, the hosting service we’ve been running for a few years already. Its new incarnation will be open source and serverless.
The first version of the Nest, using old versions of the formats and algorithms, started operating in 2016. The goal at the time was to make it easy to test Pijul at scale, in particular the apply
and unrecord
algorithms, as well as the protocol. This goal has been achieved, as we have seen less and less failures over time.
Other secondary goals included:
Start using asynchronous Rust in the Pijul binary in the most efficient way possible, using the asynchronous libraries (Mio, Futures, Tokio…), which were very new at the time. It wasn’t clear (at least to me) how things would pan out eventually. The Thrussh crate is an example of a byproduct of this approach.
This also forced me to carry out massive code reorganisations in the Nest as things evolved, but I believe many projects were doing more or less the same thing.
See how I could use Sanakirja as a general database, possibly building a database CRDT on top of it. Sanakirja wasn’t yet the fastest key-value store around, but it already had a number of unique features like composable datatypes, like what Pijul uses for branches/channels (which are maps of names (strings) to a tuple of a number of maps of different types).
This was never implemented, unfortunately, but ongoing research projects around Sanakirja could make this interesting in the future.
When the new repository format came out with the alpha release in November 2020, I obviously had to rewrite large parts of the Nest to account for it. Then, the OVH Strasbourg fire in March 2021 changed priorities a little bit and prompted for the development of a better replication/backup strategy. This was completed during the two weeks outage following the fire.
I described earlier how this architecture works. It has been working fine for about two years, except for one thing: database replication. The repository replication hasn’t given any single problem, but the database replication cannot be made stable. This is partly due to the fact that we’re using the backup servers as local caches to speed some transactions up. And also possibly due to the undeprovisionning of the machines on which these things run.
Concretely, for the three machines currently used by the Nest (one in France, one in Canada, one in Singapore), this currently results in switchovers as soon as the leader machine is under too heavy a load to send its heartbeats on time to the others. As a side-effect, the Postgres server on the leader fails to send its “WAL” files, which can cause occasional crashes and data loss, not acceptable if this service were to be used industrially.
While provisioning bigger machines sounds like an obvious fix, it doesn’t feel right: not enough computing power should result in delays, lost connections and downtime, not in data loss. Not to mention the Rube Golberg machine that would result from using an orchestrator to manage our tricky replication setup.
So, we don’t just want to “fix” the Nest, we also want to move on, since its goal has been fulfilled. Our next target is to build a service that:
Today I’m please to announce a project ticking all these boxes. Indeed, the new Nest is a collection of TypeScript programs running on Cloudflare Workers (Cloudflare’s FaaS solution), plus some WASM code to fake interactions with an actual Pijul repositories. The choice of Cloudflare is somewhat arbitrary, and we would like to make our code generic enough to be run on other platforms.
One major challenge for FaaS scripts without any access to an actual hard drive is to fake being a full-blown Pijul repository. Fortunately, Pijul is built on top of Sanakirja, a highly generic storage layer (in addition to being faster than other key-value stores). Sanakirja has a number of advantages for this job:
Sanakirja has multiple layers itself, including its own modular storage layer. Any type implementing the LoadPage
and AllocPage
Rust traits (defined in the sanakirja-core
crate) can be used as a storage layer. This includes using a simple memory-mapped file, which is what the sanakirja
crate uses, but different choices are possible with very little extra code.
The only example of a “different choice” I was thinking of when working on this design was the “tag” feature in Pijul, where an entire Sanakirja database is compressed to a ZStd-seekable file and can then just be read as a normal repository, using the same code, without having to decompress the entire file.
The project described in this post gives another example, where a FaaS key-value store is used as a storage backend for Sanakirja. The primary motivation was the ability to fork tables without copying any single byte, but this idea now provides a way to implement different datastructures on top of a basic KV store.
A third example (not yet implemented; please contact me if you’re interested in helping) would be to use io_uring
instead of mmap
to get optimal control on memory usage and concurrency.
Sanakirja has a very cheap reference-counting mechanism (costing nothing for the tables that don’t use it), allowing us to clone a table at an extremely low cost. This is how branches are handled in Pijul, for example, and the main reason why we couldn’t use the standard KV mechanisms available in Cloudflare Workers.
Sanakirja has a simple and powerful synchronisation mechanism, ensuring that read-only transactions do not get corrupted by newer read-write ones. This turned out to be a good match for the guarantees provided by Cloudlare Workers: indeed, when storing things in Cloudflare’s native key-value store, the guarantee is that all datacenters in the world see the new version at most one minute after it has been written. Our strategy is therefore for Sanakirja to write data at commit time, but prevent the reuse of memory freed by transactions committed less than one minute ago.
In the design of Sanakirja, the only problem would be if a reader transaction lasted long enough for its memory to be deleted or overwritten. A reader transaction would need to run at least one minute (and be extremely unlucky) in order for that to happen. But this is prevented by the limits imposed by Cloudflare on the running time of requests (at most 30s for HTTP requests).
This gives a way to get around two apparent limitations (lag and time limits) to build ACID transactions on general-purpose datastructures.
Finally, the most basic use of Sanakirja is B trees (where the keys and values may have interesting types, e.g. B trees of B trees), since this is the basic datastructure used to write its own allocator. However, we can also use Sanakirja to build datastructures other than B trees. I’ve already done that, it’s fun, and I’ll explain in future blog posts how to do it. We don’t leverage this yet in the new Nest, but we will.
There is still very little documentation on this new setup, but now that we have a working prototype we will expand the manual to include documentation on this.
Meanwhile, here’s a basic help: the domain meant to interact with the Pijul CLI tool is dot.pijul.org
. You can authenticate with your signing key, by installing the latest Pijul beta (1.0.0-beta.5
) and then adding something like the following to .pijul/config
in your repository:
# Allows you to just use `pijul push`
default_remote = "nest"
[[remotes]]
name = "nest"
# The address of my repository, in this case pmeunier/nest (adjust that line to your own repos.
http = "https://dot.pijul.org/pmeunier/nest"
# This line uses your patch signing keys to authenticate with the Nest.
headers.Authorization = { shell = "pijul client https://nest.pijul.org/auth" }
By its serverless design, this project is split into a number of packages, most of them responsible for an entire feature. All of these services will ultimately be released under the AGPL-3.0 license. I’ll start releasing these one by one, starting with the UI today.
Also, we’re now offering pro accounts (5€/month), allowing Nest users to define private repositories without a storage limit (storage is billed independently above 100Mb, at 0.01€/Gb·day).
This is an entire new design, in particular using Cloudflare Workers in new and different ways (building large datastructures on top of their platform). Obviously, there will be bugs. Please be patient while we fix them.
Also, we welcome contributions, feel free to join on our Zulip and Discourse.