sorry for the lengthy post, I have a question about common crpytographic hashing algorithms, such as the SHA family, MD5, etc.
In general, such a hash algorithm cannot be injective, since the actual digest produced has usually a fixed length (e.g., 160 bits under SHA-1, etc.) whereas the space of possible messages to be digested is virtually infinite.
However, if we generate a digest of a message which is at most as long as the digest generated, what are the properties of the commonly used hashing algorithms? Are they likely to be injective on this limited message space? Are there algorithms, which are known to produce collisions even on messages whose bit lengths is shorter than the bit length of the digest produced?
I am actually looking for an algorithm, which has this property, i.e., which, at least in principle, may generate colliding hashes even for short input messages.
Background: We have a browser plug-in, which for every web-site visited, makes a server request asking, whether the web-site belongs to one of our known partners. But of course, we don't want to spy on our users. So, in order to make it hard to generate some kind of surf history, we do not actually send the URL visited, but a hash digest (currently, SHA-1) of some cleaned up version. On the server side, we have a table of hashes of well-known URIs, which is matched against the received hash. We can live with a certain amount of uncertainty here, since we consider not being able to track our users a feature, not a bug.
For obvious reasons, this scheme is pretty fuzzy and admits false positives as well as URIs not matched which should have.
So right now, we are considering changing the fingerprint generation to something, which has more structure, for example, instead of hashing the full (cleaned up) URI, we might instead:
- split the host name into components at "." and hash those individually
- check the path into components at "." and hash those individually
Join the resulting hashes into a fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as hash) might yield "46989670.104268307.41353536/19857610/73204162".
However, as such a fingerprint has a lot of structure (in particular, when compared to a plain SHA-1 digest), we might accidentally make it pretty easy again to compute actual URI visited by a user (for example, by using a pre-computed table of hash values for "common" compont values, such as "www").
So right now, I am looking for a hash/digest algorithm, which has a high rate of collisions (Adler 32 is seriously considered) even on short messages, so that the probability of a given component hash being unique is low. We hope, that the additional structure we impose provides us with enough additional information to as to improve the matching behaviour (i.e., lower the rate of false positives/false negatives).