2
votes

Current hash functions are designed to have big changes on hash even if only a very small portion of input is changed. What I need, is a hash algorithm which output mutation will be directly proportional to input mutation. For example, I need something similar to this:

Hash("STR1") => 1000
Hash("STR2") => 1001
Hash("STR3") => 1002

etc. I'm not good at algorithms, but never heared of such implementation, although I'm almost sure someone should already come up with this algorithm.

My current requirement is to have large bitrate (512 bits maybe?) to avoid collisions.

Thanks

UPDATE

I think I should clarify my goal, I see that I did a very poor job explaining what I need. Sorry, I'm not a native English speaker and great communicator.

So basically I need this hash algorithm for searching similar binary files. You can think of it as Antivirus hashing algorithm. It calculates file checksum, but unlike traditional hashing functions, even after some small modification in malware binary, it still is able to detect it. This is pretty much what I'm looking for.

Another aspect is to avoid collision. Let me explain what I mean by that. It's not a conflicting goal. I want Hash("STR1") to produce 1000 and Hash("STR2") to produce 1001 or 1010 maybe, doesn't matter as long as the value is close relative to previous hash. But Hash("This is a very large string or maybe even binary data" + 100 random chars) should not produce a value close to 1000. I understand it will not work always and there would be some hash/hash-range collisions, but I think I can introduce another hashing algorithm and verify both to minimize collisions.

So what do you think? Maybe there is a better way to achieve my goal, maybe I'm asking too much, I don't know. I'm not well versed in Cryptograpy, math or algorithms.

Thank you again for your time and effort

6
I hope you know this is very weak security-wise, but I think I may be able to find something...Laurel
Yes, it's not for security purpose, but for search :). Thanks for your effort Laurel :)Davita
Do you need "1str", "2str", "3str" to hash close together also?brian beuning
Avoiding collisions is incompatible with your goal of preserving "closeness" of hash results. You'll have to pick one.user149341
Locality-sensitive hashing can do this, although you end up with more collisions. If your data set is known and reasonably small, you can create a perfect hash function, although that doesn't fulfill your goal of small input change resulting in small output change. A minimal perfect hash might be what you're looking for.Jim Mischel

6 Answers

2
votes

How about a simple summation? Your hash can then wrap at the desired size, and if you take this into account during hash comparisons, a small difference in inputs should yield a small difference in hashes.

However, I think "minimal collisions" and "proportional change in output" are conflicting goals.

1
votes

This is called, in other domains, perceptual hashing.

One approach to this is as follows:

  1. Get a training multiset of n-grams. (E.g. if n=2 and your training data was "This is a test" your training set would be "Th", "hi", "is", "s ", etc)
  2. Sort and calculate the frequencies of said n-grams, decending.

Then the hash of a word is the first bits of "for each n-gram in the database, is this word's frequency said n-gram higher than the average frequency?"

Note that this can and will result in many collisions with similar words, unfortunately, unless the hash length is absurdly long.

1
votes

MD5 or SHA-x is not what you want.

According to wikipedia, for example the substitution cipher has no avalanche effect (this is the word you mean).

In terms of hashing you could use some kind of a figure total.

For example:

char* hashme = "hallo123";
int result=0;
for(int i = 0; i<8; ++i) {
   result += hashme[i];
}
0
votes

It may be geared towards kids, but the old NSA Kid's section has some really good ideas.

Of course, these algorithms are really insecure, so you cannot use this in place of REAL encryption. (But you can't use a real encryption algorithm when you just want to have fun, either.)


The number grid involves setting up a grid, then using the coordinates of each letter:

grid of letters

Further ideas:

  • Mix up the letter arangement
  • Convert numbers to binary to obfuscate

A winding way also uses a grid. Essentially, the letters are packed in the grid left to right, in rows downwards. The output is produced by slicing vertically through the grid:

The password is an enigma

0
votes

Typically hash and encryption algorithms oriented towards cryptography will behave in the exact opposite way of what you're looking for (i.e. small changes in the input will cause large changes in the output and vice versa), so this algorithm class is a dead end.

As a quick digression on why these algorithms behave like this: of necessity, they're designed to obscure statistical relationships between the input and output to make them more difficult to crack. For example, in the English language the letter "e" is by far the most commonly-used letter; in some very weak classical ciphers you could simply find the most common letter and figure that that corresponds to "e" (e.g. - if n is the most common letter, then odds are n = e). Actually, a statistical pattern like you describe would likely make the algorithm significantly more vulnerable to chosen-plaintext, known-plaintext, man in the middle, and replay attacks.

The man in the middle and replay attacks would be made significantly easier by the fact that it would be much easier to edit the ciphertext to achieve the desired plaintext without knowing the key (especially if you have access to a couple of chosen plaintexts).

If you know that

7/19/2016 1:35 transfer $10 from account x to account y

(where the datestamp is used to defend against a replay attack) encodes to

12345678910

whereas

7/19/2016 1:40 transfer $10 from account x to account y

encodes to

12445678910

it's a pretty safe guess that

12545678910

will mean something like

7/19/2016 1:45 transfer $10 from account x to account y

Without having access to the original key, you could replay this packet on a regular basis to continue to steal money from someone's account simply by making a trivial edit. Granted, this is a fairly contrived example, but it still illustrates the basic problem.

My understanding of what you're looking for is statistical similarity between files. This might help some: https://en.wikipedia.org/wiki/Semantic_similarity

0
votes

This does indeed exist. The term is Locality-sensitive hashing. A concrete implementation can be found here: https://github.com/trendmicro/tlsh . Depending on the source document you might want to look at digital forensics or VisualRank (from google) for finding similar images and video. For textual data this is commonly used in anti-spam (read more here: http://spdp.di.unimi.it/papers/pdcs04.pdf). For binary files you might want to first run disassembler and then run the algorithm on the text version - but this is just my feeling, I don't have a research to back this statement but it would be an interesting hypothesis to test.