1
votes

I have millions of documents stored in MongoDb, each one having 64 bit hash.

As an example:

0011101001110001001101110000101011010101101111101110110101011001 doc1 0111100111000011011011100001101010001110111100001101101100011111 doc2

and so on.

Now I would like to find all the documents that have hamming distance <= 5 in an efficient way, given the input that is dynamic, without querying all the results one by one.

There are few solutions I found:

A) pre filter the existing result set Hamming Distance / Similarity searches in a database have not given this go yet, seems interesting to say the least, but can't find any information in the internet how efficient this will be.

B) use some kind of metric-space solution (this involves having another separate structure to keep things in sync etc)

For the purpose of this question, I'd like to narrow it down a bit further, and know if it is possible to "exploit/hack" mongodb provided geospatial indexes. (https://docs.mongodb.com/manual/core/2dsphere/)

The geospatial indexes:

A) allow you to store GeoJSON objects (point, line, polygon)

B) query efficiently all the GeoJSON objects

C) support operations such as finding geojson objects with radius+point, as well geojson intersection between objects

If I could find a way how to map these 64bit hashes to latitude/longitude (OR maybe into polygons) in such way that similar hashes (hamming distance) are grouped more closer to each other, the geospatial index could work well maybe if I say: from this latitude and longitude point, give me all the binary strings in the radius of 5 (hamming distance), it could work?

the problem is I have no idea if any of this is even feasible.

really old question I found: https://groups.google.com/g/mongodb-user/c/lmlcugk2dFs?pli=1

1

1 Answers

0
votes

Hamming distance, when applied to binary data, can be considered a directed graph problem.

For 2 bit values, the first bit is the x coordinate, the second is y, and the hamming distance between any two points is the number of sides that must be traversed to move from one to the other.

For 3 bit values, the third bit is the z coordinate, and the points are the vertices of a cube.

For 4 bits, that is a tesseract, and much harder to visualize.

For 64 bits, each value would be one of the vertices on a "unit cube" in 64 dimensions.

Each point would have 64 neighbors with a hamming distance of exactly 1.

One possibibility is to trade a few extra gigabytes of storage for some performance in finding other points within the hamming distance.

Pre-calculate the hash values of the 64 immediate neighbors, regardless of whether they exist in the data set or not, and store those in an array in the document with the original hash. This might be quite a daunting task for already existing documents, but is a bit more manageable if done during the initial insert process.

You could then find all documents whose hashes are within a hamming distance of 5 using the $graphLookup aggregation stage.

If the hash is stored in a field named hashField and the hashes that are a distance of 1 are in a field named neighbors, that might look something like:

db.collectionName.aggregate([
    {$match: {<match criteria to select starting hash>}},
    {$graphLookup: {
        from: "collectionName",
        startsWith: "$neighbors",
        connectFromField: "neighbors",
        connectToField: "hashField",
        as: "closehashes",
        maxDepth: 5,
        depthField: "distance"
    }}
])

This would benefit greatly from an index on {hashField: 1}.