Almost three months ago, a fire started in the datacenter that hosted this website, as well as nest.pijul.com. In this post, I try to reflect on what that meant for us, and the consequences it had.
Our servers were hosted in a datacenter run by OVH, the largest cloud provider in Europe. Part of the reason includes the fact that I was expecting the most active users initially to be close to the authors, which turned out to be quite wrong (the most active users of the Nest are based in Latvia, New Zealand, France and the US).
The datacenter that hosted our servers was in Strasbourg, and caught fire in the early morning of March 10th, 2021. The exact cause of the OVH fire is still under investigation (as of early June, 2021). It is known to have been caused by a UPS unit which had been serviced the day before the fire, but according to OVH, we don’t really know why the fire wasn’t detected fast enough, or why it expanded so quickly.
OVH is a pretty cheap cloud/hosting provider, and is cheap in many ways: because they’ve been around for more than 20 years as a hosting provider, and then as a “cloud” company, they have accumulated a number of different GUIs and APIs to access their services, several of which still in operation. This leads to a rather unintuitive website. But since that company has a radically geeky spirit, all APIs are easy to access, reasonably well documented, so that writing your own cloud tools is easy, since these APIs are documented reasonably well.
After the fire, their reaction has been widely praised for its transparency, and has also been criticised for some parts of their alleged design (even though videos published by OVH seem to contradict some allegations made in that article). Some people (including myself) have learned with this accident that most features of datacentres, no matter the company, are trade secrets. Not that this is any surprise, but this is a question many users of “the cloud” have probably never asked. Moreover, since big insurance money is probably involved, I doubt there’s any way to know exactly what happened before the end of the enquiry. One cool part of their response involved Octave Klaba (OVH’s main shareholder) announcing that their fire safety infrastructure would be open sourced.
Given that Pijul and the Nest were both advertised as experimental, and we knew that replicating repositories would need its own protocols and algorithms, we didn’t have any replication in place. We had offsite backups of the database, mirrored to the cloud using Restic on top of OpenStack, but we didn’t have backups of the repository, for reasons I explain below.
In hindsight, just setting an fcron
job calling rsync
every day to sync the repositories to my machine wouldn’t have costed much, and would have allowed me to recover faster. The total disk space occupied by all repositories isn’t that big anyway.
After the fire, the Nest was unavailable for two weeks, during which I started work on a replication protocol. I also (re)defined the backup strategy.
Replication doesn’t replace backups, and serves a different purpose: it allows a service to stay afloat even in the presence of fires and a number of other problems. You can still make mistakes and trash the database, or get hacked, in which cases a backup is still essential.
Since fires and network outages do happen, there’s a whole field of computer science called distributed algorithms, dedicated to making services robust to these events. In our particular case, Pijul is essentially a distributed datastore, which makes it particularly easy to turn into a replicated system. The only thing that was missing is a network protocol layer for a server to inform others about new changes.
This is now quite stable, and has been put in production for a week now. The Nest now has three servers in different geographic locations (France, Québec and Singapore), served behind a Cloudflare proxy. I’ll try to summarise a few design principles at different levels of the stack:
First, the CAP theorem, which isn’t particularly surprising, and can also be proved quite easily. A young, ambitious CS student looking forward to solving hard problems might even consider it “trivial”. And it certainly is easy to prove, if stated like that as an exercise. However, solving hard problems isn’t the main goal of research in mathematics and computer science: the main goal is understanding things, and we know that naming things is at least 99% of understanding them.1.
In this case, clearly naming desirable properties of a distributed system is indeed a very large fraction of the job, and is incredibly useful at the time of actually implementing a system, since you can clearly identify the properties you’re trying to implement, and from there immediately tell which are possible to get and which are not.
More specifically, in the context of the CAP theorem, network partitions do happen, therefore we have to sacrifice either availability or consistency. This gives two major classes of algorithms: CRDTs and blockchains sacrify consistency, whereas Raft and Paxos sacrify availability.
Another way to try and get a reasonable solution is called eventual consistency, where a service initially sacrifies consistency, but trusts the internet to not remain partitioned for too long. This can also be seen as a weakening of the partition tolerance property. For example, in the particular case of bitcoin, consistency is restored by electing a leader periodically (every ten minutes or so, by mining), and hoping that the result of the election can be broadcast fast enough to the entire internet before another leader gets elected (meaning before the chain splits). If that fails, bitcoin users agree to use the longest chain, which causes data loss.
In our case, we have one “local” database server on each machine of the cluster, used to store the changes coming from the other machines. Every time a new change is received from a user, we try to send it to other members of the cluster, and keep trying after succeeding for two other machines: these two machines will in turn send it to two of their neighbours, which guarantees exponentially fast propagation. The main difficulty is to deal with all kind of changes, including “meta” changes not modelled by patches (such as channel creation, deletion and renaming), and inverse operations such as unrecording a patch.
Then, the technical details of load balancing. A naive implementation of load balancing, which you can implement at home on your laptop using HAProxy, is just a proxy in front of multiple servers. This is fine, but with the same cost and number of servers, you can build a much faster website for your users, by spreading your servers geographically. Unfortunately, there is something you can only partially do “at home”: routing the traffic to the server closest to the user. One hack you can do with most domain providers is to host your own DNS servers, setup one per zone, and use DNS anycast to serve different IP addresses to the requests made by your users to your DNS servers. There are two problems with this:
You should probably not host your own DNS servers, mainly because DNS servers need to be fast and have zero downtime, and for that reason have their own redundancy and network systems, and
There is a better way: use IP anycast, which allows multiple servers to share the same IP. This routes the traffic to the “most suitable” server with the IP, and allows the network to re-route the traffic to another server if one server stops responding for one reason or another. The reason you can’t do this “at home” is that individual users of the internet don’t usually have much control over IP routing, whereas network operators, CDNs and cloud companies are the ones organising the network, and can play all sorts of tricks with the IP addresses ICANN gives them.
With this solution, your servers still have their own individual IP addresses, and you can still SSH to the exact machine you want: only the proxy in front of them shares its IP with many others. Yet another reason to use a proxy is to cache a large fraction of your content, making it much much faster for your users.
After a few tests, including OVH’s load balancer, I decided to use Cloudflare’s load balancer and proxy, which is not only very cheap, but also has many more endpoints than all the other networks I considered. It can also act as a CDN, which allowed us to get a speedup in the Nest’s response times by a factor 30 in the worst regions (the Nest was apparently quite annoying to use from New Zealand, for example).
The only downside is that Cloudflare’s reasonably-priced plans are only useful for HTTP traffic, and can’t proxy IP traffic, which would have been helpful for our SSH host, or even future plans to push and pull patches using QUIC (HTTP 3). We don’t have a proper solution yet: pushing things to the Nest now has to be done via ssh.pijul.com
(the manual has been updated accordingly).
Finally, replicating databases is so essential to operating websites that many solutions exist. The two main open source database servers, MySQL/MariaDB and PostgreSQL, have builtin algorithms to do that. However, they need an extra layer on top of them to organise failover when one of the servers fails. That layer could consist of a leader election protocol (such as Raft) to tell which database server is the leader, and which are the followers, as well as to organise failover when the leader fails.
This is what Patroni does, for example. However, after trying to use Patroni for weeks, and seeing it fail to re-elect a leader many, many times (often in the middle of the night)2, leaving my servers unusable, I decided to write my own baby replicator using only one super basic strategy, leveraging the existing streaming replication features as implemented in PostgreSQL 12 (or later), as well as a battle-tested leader election tool called Etcd.
The result is available here.
Before the fire, the backups were done in the same datacentre as the main server, which is a mistake and isn’t even cheaper than using a different datacentre. The database was cloned to my laptop regularly (about once a week), and backed up to that datacenter using Restic on top of OpenStack. The repositories, however, were not backed up, mostly because I thought the repository format could still change and make the backups irrelevant. I had also not thought carefully about replication (there are only so many things one can think about at the same time). These aren’t good reasons, though, since a format change definitely didn’t happen at the same time as the fire, and I could have restarted a server on the same day if I had rsync’ed the repos every day.
The new way is to rsync the repositories and the database to my laptop every day, and from there save it to the cloud with Restic. This should allow me to get back on track quickly if really bad things happen, such as a global outage of all three servers simultaneously, together with a fire in the backup server.
Since fcron jobs are easy to forget, I also have an indicator in my i3 bar showing the latest date at which a backup succeeded.
One question raised by the fire was, should we stick with OVH, or change provider? Since this is not the first major outage on OVH’s infrastructure, the question isn’t obvious to answer. Outages are a risk with any provider, they just need to be adequately mitigated.
In our case, we have a number of requirements linked to the fact that we host user-generated content, and we know our users may not want to be subjected to censorship by other countries or organisations. Since 2018, the CLOUD act directly enables the US administration to enforce US law onto data stored in any other country, whenever the hosting company is registered in the US. I wrote directly there, because there are also other ways the US could enforce their own laws onto foreign companies. The EU has theoretical ways against that (called for example Blocking statute and Instex), but for some reason they seem designed to be useless, and hence no company is using them.
A bunch of competitors to OVH are worth mentioning here, especially in Europe, since GDPR is one of the strictest data protection regulations in the world, and like the CLOUD act in the other direction, has inspired other countries to adjust their own laws. Some of the largest cloud providers in the EU are Hetzner, Dassault Systems and Scaleway.
However, these companies are way too active politically for our needs, being either directly involved in censorship moves or controlled by the arms industry. Scaleway is a more complex case: their offer is very atractive, but their largest shareholder is involved in politics, to the point of buying a major press outlet. While this isn’t a no-go for most projects, it is enough to tip the balance in our specific case.
Now, this doesn’t mean that these services won’t suit your own projects, just that the Nest can’t really achieve its mission of bringing easy and sound collaboration to the world (“peaceful” and “respectful” shouldn’t even need to be mentioned here), while depending on services as close to politics as these.
@rohan is one of the most active contributors to Pijul. He’s done an amazing job on supporting data encodings other than UTF-8 in diffs without treating them as binary, and taught me about the Southern Cross Cable, a fascinating piece of Internet infrastructure. The route he used was through that cable across the Pacific ocean, then a route across the US, and finally across the Atlantic ocean and a bit of France. Now he’s using the Singapore server, cached by Cloudflare in Auckland.
The realisation that naming things is a discipline of its own has probably even been one of the greatest discoveries of the 20th century. Cantor probably started that, by rebuilding foundations for mathematics (definitions were particularly fuzzy before him), Wittgenstein established a link with philosophy, blurring the distinction between mathematics and philosophy. Kuhn even established a distinction between the scientists who name things (whom he called “revolutionary”) and the others (the “normies”). Deleuze restated the role of the philosopher as a creator of concepts (which also applies outside of science), or in other words, as a professional namer. And by the way, the history of Computer Science is full of such half-philosophical, half-mathematical discoveries, where naming is almost everything: Turing machines, Communication Complexity, Yao’s principle… ↩︎
Patroni itself is probably fine, but tries to be extremely generic in its backends to follow all the evolutions in the ecosystem: for example, patronictl
worked fine for me for Etcd 2, but I would never see any server in the cluster using Etcd 3. ↩︎