1
votes

Until now I have been doing backups with rsync from my computer to an external drive. The backup data is made of tens of thousands of small files and hundreds of big ones (Maildir email messages and episodes of my favorite series). The problem with this is that if a single sector of my backup disk fails, perhaps a single message may be corrupted, which I find intolerable.

I have thought of an alternative that works as follows. There are three trees: the file tree consisting of the data I wish to backup, the backup tree containing a copy of the file tree at a given moment in time and a hash tree which contains file hashes and metadata hashes of the backup tree. A hash of the whole hash tree is also kept. Prior to a backup, the hash of the hash tree is checked. A failure here invalidates the whole backed up data. After the check succeeds, the hash tree shape is compared to the backup tree shape and the metadata hashes are verified to ensure the backup tree is metadata and shape consistent. If it is not, individual culprits can be listed. After this, the rsync backup traversal is performed. Whenever rsync updates a file, its new hash and metadata hash are computed and inserted into the hash tree. Whenever rsync deletes a file, that file is removed from the hash tree. In the end, the hash of the hash tree is computed and stored.

This process is very useful because the hashes are computed for correct data, meaning even if a file in the file tree is corrupted after it has been inserted in the hash tree, this inconsistency does not invalidate the backup (or future backups). The most important property, however, is that if an attacker corrupts the backup medium however he likes, the information that lies there will be trusted if and only if it is correct, unless the attacker has broken the hash algorithm. Also, the data sent to the backup or restored from it can be verified incrementally.

My question is: is there a reasonable implementation of such a backup scheme? My searches tell me that the only backup schemes available either do full or differential backups (tar based, for instance) or fail to provide a cryptographic correctness guarantee (rsync).

If there are no implementations of anything like that, maybe I will write one, but I would like to avoid reinventing the wheel.

4
I think this belongs on serverfault or maybe superuser. - Omnifarious
I think I've heard the phrase 'merkle tree' used with concepts similar to this? I think forensic analysis of drives might involve things like this. you can use something like hashdeep to add hashes to data. also see tahoe-lafs - fuzzyTew

4 Answers

2
votes

What you're talking about sounds a lot like Git. I think it would pretty much do what you're describing. Just implement the process of "backing up" as git commit. You can then restore to any previous version with git checkout.

It is amazingly storage efficient and extremely fast for transfering content, which would probably save you a lot of time on your backups. As a bonus, it's free, portable and already debugged!

0
votes

This sounds almost exactly identical to how the Mercurial storage system works. The 'rsync command' would be implement using Mercurial's push, which is remarkably network efficient.

0
votes

If I had to solve the problem, I'd take RAID array (to prevent corruption) of drives, which use built-in AES encryption, and then would use any backup method I am used to.

0
votes

Git-Annex is the proper solution to this problem given available tools. It is an extension to git which allows robust support for files which are arbitrarily large, synchronizes between datastores automatically, has an optional graphical user interface, tracks how many backups you have and precisely what files are stored where, and allows you to set rules for how it should manage different content. You can also customize what cryptographic hashes are used to validate the integrity of the content.

For needs of drive backups, git-annex has interoperability with bup which has more features tuned towards those looking for regular backups of entire systems.