I have millions of documents stored in MongoDb, each one having 64 bit hash.
As an example:
0011101001110001001101110000101011010101101111101110110101011001 doc1 0111100111000011011011100001101010001110111100001101101100011111 doc2
and so on.
Now I would like to find all the documents that have hamming distance <= 5 in an efficient way, given the input that is dynamic, without querying all the results one by one.
There are few solutions I found:
A) pre filter the existing result set Hamming Distance / Similarity searches in a database have not given this go yet, seems interesting to say the least, but can't find any information in the internet how efficient this will be.
B) use some kind of metric-space solution (this involves having another separate structure to keep things in sync etc)
For the purpose of this question, I'd like to narrow it down a bit further, and know if it is possible to "exploit/hack" mongodb provided geospatial indexes. (https://docs.mongodb.com/manual/core/2dsphere/)
The geospatial indexes:
A) allow you to store GeoJSON objects (point, line, polygon)
B) query efficiently all the GeoJSON objects
C) support operations such as finding geojson objects with radius+point, as well geojson intersection between objects
If I could find a way how to map these 64bit hashes to latitude/longitude (OR maybe into polygons) in such way that similar hashes (hamming distance) are grouped more closer to each other, the geospatial index could work well maybe if I say: from this latitude and longitude point, give me all the binary strings in the radius of 5 (hamming distance), it could work?
the problem is I have no idea if any of this is even feasible.
really old question I found: https://groups.google.com/g/mongodb-user/c/lmlcugk2dFs?pli=1