Hash with input change tolerance threshold

Question

I have a rather abstract question: usual hashing algorithms (both cryptographic and non-cryptographic) change drastically if input changes even slightly.

Digest::SHA1.hexdigest 'hello'
=> "aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d"
Digest::SHA1.hexdigest 'hello!'
=> "8f7d88e901a5ad3a05d8cc0de93313fd76028f8c"

Are there hash algorithms that don't change the output when input changes slightly?

Ideally such algorithm should have a tolerance setting, which should tell how much of the input changes the hash should tolerate before changing the output.

For example, if input tolerance is 70%, these "hello" and "hello!" strings should produce the same hashed output, but if it's 95%, then these two strings should produce different (slightly) output.

Maybe it's not called hashing at all, but this area is an unknown unknown to me.

It appears to be called the "avalance effect". programmers.stackexchange.com/questions/273809/… — Millie Smith
I think the word you might be looking for is fuzzy hashing — halex

user2735420 user2735420 · Accepted Answer · 2015-03-06T10:20:38

you might look into document comparison algorithms. That's closer to the class of algorithm you need.

see

Text comparison algorithm

From there you can calculate the % that's changed. This would require that you keep the original text to compare with--not a small, hashed value. But I don't see any possible way that storing a small value, like a hash, is going to be useful to you.

Hash with input change tolerance threshold

1 Answers