Until now I have been doing backups with rsync from my computer to an external drive. The backup data is made of tens of thousands of small files and hundreds of big ones (Maildir email messages and episodes of my favorite series). The problem with this is that if a single sector of my backup disk fails, perhaps a single message may be corrupted, which I find intolerable.
I have thought of an alternative that works as follows. There are three trees: the file tree consisting of the data I wish to backup, the backup tree containing a copy of the file tree at a given moment in time and a hash tree which contains file hashes and metadata hashes of the backup tree. A hash of the whole hash tree is also kept. Prior to a backup, the hash of the hash tree is checked. A failure here invalidates the whole backed up data. After the check succeeds, the hash tree shape is compared to the backup tree shape and the metadata hashes are verified to ensure the backup tree is metadata and shape consistent. If it is not, individual culprits can be listed. After this, the rsync backup traversal is performed. Whenever rsync updates a file, its new hash and metadata hash are computed and inserted into the hash tree. Whenever rsync deletes a file, that file is removed from the hash tree. In the end, the hash of the hash tree is computed and stored.
This process is very useful because the hashes are computed for correct data, meaning even if a file in the file tree is corrupted after it has been inserted in the hash tree, this inconsistency does not invalidate the backup (or future backups). The most important property, however, is that if an attacker corrupts the backup medium however he likes, the information that lies there will be trusted if and only if it is correct, unless the attacker has broken the hash algorithm. Also, the data sent to the backup or restored from it can be verified incrementally.
My question is: is there a reasonable implementation of such a backup scheme? My searches tell me that the only backup schemes available either do full or differential backups (tar based, for instance) or fail to provide a cryptographic correctness guarantee (rsync).
If there are no implementations of anything like that, maybe I will write one, but I would like to avoid reinventing the wheel.