1
votes

I have a rather abstract question: usual hashing algorithms (both cryptographic and non-cryptographic) change drastically if input changes even slightly.

Digest::SHA1.hexdigest 'hello'
=> "aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d"
Digest::SHA1.hexdigest 'hello!'
=> "8f7d88e901a5ad3a05d8cc0de93313fd76028f8c"

Are there hash algorithms that don't change the output when input changes slightly?

Ideally such algorithm should have a tolerance setting, which should tell how much of the input changes the hash should tolerate before changing the output.

For example, if input tolerance is 70%, these "hello" and "hello!" strings should produce the same hashed output, but if it's 95%, then these two strings should produce different (slightly) output.

Maybe it's not called hashing at all, but this area is an unknown unknown to me.

1
It appears to be called the "avalance effect". programmers.stackexchange.com/questions/273809/…Millie Smith
I think the word you might be looking for is fuzzy hashinghalex
avalanche*. can't edit it anymore.Millie Smith

1 Answers

1
votes

you might look into document comparison algorithms. That's closer to the class of algorithm you need.

see

Text comparison algorithm

From there you can calculate the % that's changed. This would require that you keep the original text to compare with--not a small, hashed value. But I don't see any possible way that storing a small value, like a hash, is going to be useful to you.