Changes on binary files

Sunday, July 11, 2021
By Pierre-Étienne Meunier

We just released Pijul 1.0.0-alpha.52, which includes a pretty cool change: short patches on binary files.

A quick recap on Pijul repositories

The Pijul repository format isn’t particularly simple, but it essentially boils down to a graph of byte chunks, representing an order relation between these chunks. In my last post, I showed how that representation could be used to represent different file encodings (UTF-8, UTF-16, Latin-1…).

File encodings posed two problems: first, telling where the newline characters are requires decoding the file first, since some encodings use \n as part of character combinations. Then, in order to present a change in text format to the user, we first need to choose one encoding, even if the repository contains files with many different encodings.

For binary files, we don’t have the second problem, since we’re never presenting binary diffs to the user. The only thing we have to deal with is therefore the problem of splitting files into “lines” or pseudo-lines, in order to compare them.

The rsync algorithm

The problem of comparing files efficiently is an essential ingredient of the rsync protocol, which works by computing file summaries.

The first step of rsync (for a single file) is to compute, on the remote server, a list of weak checksums of each 8k block of a file. For example, a file with 81920 bytes computes ten 32-bit integers, or 40 bytes. The server therefore sends these 40 bytes to the client.

The client must then find which of these blocks are still present in its (the client’s) version of the file. The difficulty is that the client may have inserted new blocks in the file, of size not divisible by 8192. This seems to make the checksum computation inefficient, since the client would have to compute, for each byte offset in the file (but the last 8191), the checksum of each window of size 8192.

The brilliant idea making this comparison practical is the use of a rolling checksum, which can be efficiently updated: the oldest bytes can be removed in constant time, and new bytes can be added also in constant time.

The high-level idea is then that the client decides where its new blocks are, and sends just these. The reality is a bit more complex, because the rolling checksum is weak, and must be complemented with a strong hash in order to reduce the collision probability to almost zero.

rsync-style patches

The trick introduced in the last few version of Libpijul, and tested and debugged more carefully in version 1.0.0-alpha.45, is to use almost the same idea to compare the old version of a file with a new one. We don’t even need the more careful hashes, since we can compare the blocks directly, with the same complexity as computing a strong hash.

Since our diff algorithm was initially written to operate on bytes, we don’t even need to change anything beyond the “pseudo-line” splitting.

The only limitation at the moment is that this only works on files small enough that both versions (old and new) fit in the computer’s memory at the same time. This is due to a number of constraints:

Anyone interested in helping with that, and in testing, is welcome to ask anything they want on our Zulip.