We just released Pijul 1.0.0-alpha.52, which includes a pretty cool change: short patches on binary files.
The Pijul repository format isn’t particularly simple, but it essentially boils down to a graph of byte chunks, representing an order relation between these chunks. In my last post, I showed how that representation could be used to represent different file encodings (UTF-8, UTF-16, Latin-1…).
File encodings posed two problems: first, telling where the newline characters are requires decoding the file first, since some encodings use
\n as part of character combinations. Then, in order to present a change in text format to the user, we first need to choose one encoding, even if the repository contains files with many different encodings.
For binary files, we don’t have the second problem, since we’re never presenting binary diffs to the user. The only thing we have to deal with is therefore the problem of splitting files into “lines” or pseudo-lines, in order to compare them.
The problem of comparing files efficiently is an essential ingredient of the rsync protocol, which works by computing file summaries.
The first step of rsync (for a single file) is to compute, on the remote server, a list of weak checksums of each 8k block of a file. For example, a file with 81920 bytes computes ten 32-bit integers, or 40 bytes. The server therefore sends these 40 bytes to the client.
The client must then find which of these blocks are still present in its (the client’s) version of the file. The difficulty is that the client may have inserted new blocks in the file, of size not divisible by 8192. This seems to make the checksum computation inefficient, since the client would have to compute, for each byte offset in the file (but the last 8191), the checksum of each window of size 8192.
The brilliant idea making this comparison practical is the use of a rolling checksum, which can be efficiently updated: the oldest bytes can be removed in constant time, and new bytes can be added also in constant time.
The high-level idea is then that the client decides where its new blocks are, and sends just these. The reality is a bit more complex, because the rolling checksum is weak, and must be complemented with a strong hash in order to reduce the collision probability to almost zero.
The trick introduced in the last few version of Libpijul, and tested and debugged more carefully in version 1.0.0-alpha.45, is to use almost the same idea to compare the old version of a file with a new one. We don’t even need the more careful hashes, since we can compare the blocks directly, with the same complexity as computing a strong hash.
Since our diff algorithm was initially written to operate on bytes, we don’t even need to change anything beyond the “pseudo-line” splitting.
The only limitation at the moment is that this only works on files small enough that both versions (old and new) fit in the computer’s memory at the same time. This is due to a number of constraints:
Libpijul is meant to be compiled to platforms that don’t have a full OS with memory-mapped files: Pijul isn’t tested on WASM, but all the components of Libpijul are tested with a pure-memory backend, without touching the disk. Therefore, operating on mmapped files would require some extra feature-flagging, which hasn’t been done yet.
A slightly stronger constraint is that the comparison must be performed on byte chunks (lines, 8k blocks…). However, Pijul doesn’t store files in that way: files are stored as a graph of bytes, which must be assembled in a contiguous block of memory before being split again. Overcoming that limitation would not necessarily be too hard: we could imagine a nice abstraction to replace
&[u8] with a list of graph vertices, and implement
PartialEq on that abstraction to compare them one by one, without having to load everything up.
Anyone interested in helping with that, and in testing, is welcome to ask anything they want on our Zulip.