I'm building a system which needs to be able to find if blobs of bytes have been updated. Rather than storing the whole blob (they can be up to 5MBs), I'm thinking I should compute a checksum of it, store this and compute the same checksum a little bit later, to see whether the blob has been updated.
The goal is to minimize the following (in that order) :
- size of the checksum
- time to compute
- likeliness of collisions (2 identical checksums happening even if the content has been modified).
It is acceptable for our system to have collision not more than 1/1,000,000. The concern is not security, but simply update/error detection, so rare collisions are ok. (Which is why I put it last in the things to minimize).
Also, we cannot modify the blobs of text ourselves.
Of course, md5
, crc
or sha1
come to mind, and if I wanted a quick solution, I'd go for it. However, more than a quick solution, I'm looking for what could be a comparison of different methods as well as the pros and cons.