0
votes

Currently I work on an application where I have large number of hash values (strings).
When a query hash value (string) is given, the search process goes through those strings and return strings where the Hamming Distance between the query string and the result string is less than a given threshold.

  • Hash values are not binary strings. e.g. "1000302014771944008"
  • All hash values (strings) has the same fixed length.
  • Threshold values is not small (normally t>25) and can be vary.

I want to implement this search process using an efficient algorithm rather than using brute-force approach.
I have read some research papers (like this & this), but they are for binary strings or for low threshold values. I also tried Locality-sensitive hashing, but implementations I found were focused on binary strings.

Are there any algorithms or data structures to address this problem?
Any suggestions are also welcome. Thank you in advance.

.

Additional Information

Hamming Distance between non-binary strings

string 1: 0014479902266110001131133
string 2: 0014409902226110001111133
          -------------------------
               1     1        1    = 3 <-- hamming distance

Considered brute-force approach

  1. calculate Hamming Distance between first hash string and the query hash string.
  2. if Hamming Distance is less than the threshold, then add the hash string to the results list.
  3. repeat step 1 and 2 for all hash strings.
2
On average, what proportion of the data is returned by a query? That is, how long are your hash strings and how big are your thresholds, relatively speaking?Timothy Shields
Missing from this question are 1) the precise definition of Hamming Distance as applied to non-binary strings 2) a description or example of the brute-force approach that you've already considered.user3386109
@Timothy Shields There is no limit on number of results from a query. Roughly, length of a hash string is 250 and threshold between 25-75Sajith Janaprasad
@user3386109 I added more information as you requestedSajith Janaprasad

2 Answers

1
votes

Read the 7th section of the paper:

"HmSearch: An Efficient Hamming Distance Query Processing Algorithm".

The state-of-art result for d-query problem can be found at:

"Dictionary matching and indexing with errors and don’t care", which solves d-query problem in time O(m+log(nm)^d+occ) using space O(n*log(nm)^d), where occ is the number of query result.

If threshold values is not smal, there are practical solutions for binary strings, found on HmSearch.

I think it is possible to apply the same practical solutions found on HmSearch for arbitrary strings, but I've never seen those solutions.