On fires

Thursday, June 3, 2021
By Pierre-Étienne Meunier

Almost three months ago, a fire started in the datacenter that hosted this website, as well as nest.pijul.com. In this post, I try to reflect on what that meant for us, and the consequences it had.

About the fire

Our servers were hosted in a datacenter run by OVH, the largest cloud provider in Europe. Part of the reason includes the fact that I was expecting the most active users initially to be close to the authors, which turned out to be quite wrong (the most active users of the Nest are based in Latvia, New Zealand, France and the US).

The datacenter that hosted our servers was in Strasbourg, and caught fire in the early morning of March 10th, 2021. The exact cause of the OVH fire is still under investigation (as of early June, 2021). It is known to have been caused by a UPS unit which had been serviced the day before the fire, but according to OVH, we don’t really know why the fire wasn’t detected fast enough, or why it expanded so quickly.

OVH is a pretty cheap cloud/hosting provider, and is cheap in many ways: because they’ve been around for more than 20 years as a hosting provider, and then as a “cloud” company, they have accumulated a number of different GUIs and APIs to access their services, several of which still in operation. This leads to a rather unintuitive website. But since that company has a radically geeky spirit, all APIs are easy to access, reasonably well documented, so that writing your own cloud tools is easy, since these APIs are documented reasonably well.

After the fire, their reaction has been widely praised for its transparency, and has also been criticised for some parts of their alleged design (even though videos published by OVH seem to contradict some allegations made in that article). Some people (including myself) have learned with this accident that most features of datacentres, no matter the company, are trade secrets. Not that this is any surprise, but this is a question many users of “the cloud” have probably never asked. Moreover, since big insurance money is probably involved, I doubt there’s any way to know exactly what happened before the end of the enquiry. One cool part of their response involved Octave Klaba (OVH’s main shareholder) announcing that their fire safety infrastructure would be open sourced.

What we could have done better

Given that Pijul and the Nest were both advertised as experimental, and we knew that replicating repositories would need its own protocols and algorithms, we didn’t have any replication in place. We had offsite backups of the database, mirrored to the cloud using Restic on top of OpenStack, but we didn’t have backups of the repository, for reasons I explain below.

In hindsight, just setting an fcron job calling rsync every day to sync the repositories to my machine wouldn’t have costed much, and would have allowed me to recover faster. The total disk space occupied by all repositories isn’t that big anyway.

What we have now

After the fire, the Nest was unavailable for two weeks, during which I started work on a replication protocol. I also (re)defined the backup strategy.

Replication

Replication doesn’t replace backups, and serves a different purpose: it allows a service to stay afloat even in the presence of fires and a number of other problems. You can still make mistakes and trash the database, or get hacked, in which cases a backup is still essential.

Since fires and network outages do happen, there’s a whole field of computer science called distributed algorithms, dedicated to making services robust to these events. In our particular case, Pijul is essentially a distributed datastore, which makes it particularly easy to turn into a replicated system. The only thing that was missing is a network protocol layer for a server to inform others about new changes.

This is now quite stable, and has been put in production for a week now. The Nest now has three servers in different geographic locations (France, Québec and Singapore), served behind a Cloudflare proxy. I’ll try to summarise a few design principles at different levels of the stack:

Backups

Before the fire, the backups were done in the same datacentre as the main server, which is a mistake and isn’t even cheaper than using a different datacentre. The database was cloned to my laptop regularly (about once a week), and backed up to that datacenter using Restic on top of OpenStack. The repositories, however, were not backed up, mostly because I thought the repository format could still change and make the backups irrelevant. I had also not thought carefully about replication (there are only so many things one can think about at the same time). These aren’t good reasons, though, since a format change definitely didn’t happen at the same time as the fire, and I could have restarted a server on the same day if I had rsync’ed the repos every day.

The new way is to rsync the repositories and the database to my laptop every day, and from there save it to the cloud with Restic. This should allow me to get back on track quickly if really bad things happen, such as a global outage of all three servers simultaneously, together with a fire in the backup server.

Since fcron jobs are easy to forget, I also have an indicator in my i3 bar showing the latest date at which a backup succeeded.

The cloud situation

One question raised by the fire was, should we stick with OVH, or change provider? Since this is not the first major outage on OVH’s infrastructure, the question isn’t obvious to answer. Outages are a risk with any provider, they just need to be adequately mitigated.

The CLOUD situation

In our case, we have a number of requirements linked to the fact that we host user-generated content, and we know our users may not want to be subjected to censorship by other countries or organisations. Since 2018, the CLOUD act directly enables the US administration to enforce US law onto data stored in any other country, whenever the hosting company is registered in the US. I wrote directly there, because there are also other ways the US could enforce their own laws onto foreign companies. The EU has theoretical ways against that (called for example Blocking statute and Instex), but for some reason they seem designed to be useless, and hence no company is using them.

European alternatives

A bunch of competitors to OVH are worth mentioning here, especially in Europe, since GDPR is one of the strictest data protection regulations in the world, and like the CLOUD act in the other direction, has inspired other countries to adjust their own laws. Some of the largest cloud providers in the EU are Hetzner, Dassault Systems and Scaleway.

However, these companies are way too active politically for our needs, being either directly involved in censorship moves or controlled by the arms industry. Scaleway is a more complex case: their offer is very atractive, but their largest shareholder is involved in politics, to the point of buying a major press outlet. While this isn’t a no-go for most projects, it is enough to tip the balance in our specific case.

Now, this doesn’t mean that these services won’t suit your own projects, just that the Nest can’t really achieve its mission of bringing easy and sound collaboration to the world (“peaceful” and “respectful” shouldn’t even need to be mentioned here), while depending on services as close to politics as these.

Outside of Europe

@rohan is one of the most active contributors to Pijul. He’s done an amazing job on supporting data encodings other than UTF-8 in diffs without treating them as binary, and taught me about the Southern Cross Cable, a fascinating piece of Internet infrastructure. The route he used was through that cable across the Pacific ocean, then a route across the US, and finally across the Atlantic ocean and a bit of France. Now he’s using the Singapore server, cached by Cloudflare in Auckland.


  1. The realisation that naming things is a discipline of its own has probably even been one of the greatest discoveries of the 20th century. Cantor probably started that, by rebuilding foundations for mathematics (definitions were particularly fuzzy before him), Wittgenstein established a link with philosophy, blurring the distinction between mathematics and philosophy. Kuhn even established a distinction between the scientists who name things (whom he called “revolutionary”) and the others (the “normies”). Deleuze restated the role of the philosopher as a creator of concepts (which also applies outside of science), or in other words, as a professional namer. And by the way, the history of Computer Science is full of such half-philosophical, half-mathematical discoveries, where naming is almost everything: Turing machines, Communication Complexity, Yao’s principle… ↩︎

  2. Patroni itself is probably fine, but tries to be extremely generic in its backends to follow all the evolutions in the ecosystem: for example, patronictl worked fine for me for Etcd 2, but I would never see any server in the cluster using Etcd 3. ↩︎