I have a rather big dataset of N
documents with less than 1% of them being near-duplicate which I want to identify. I have many number fields, and a few text fields. I consider two documents in the data set close if...
- all but one, two or three data fields are fully identical.
- corresponding text fields of two documents are only a few edits away (that's the Levensthein distance used by ElasticSearch).
How would you approach this challenge of identifying fuzzy duplicates with ElasticSearch?
I already struggle to write a (general) ElasticSearch query for part (1), which does not explicitly use the field names. Do I really have to build a huge query of the following pattern, or is there a smarter way?
( SELECT * FROM MessyData AS T1
JOIN MessyData AS T2
WHERE T1.F1 != T1.F1 AND T1.F2 = T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
JOIN MessyData AS T2
WHERE T1.F1 = T1.F1 AND T1.F2 != T2.F2 AND T1.F3 = T2.F3 AND ... )
UNION ALL
( SELECT * FROM MessyData AS T1
JOIN MessyData AS T2
WHERE T1.F1 = T1.F1 AND T1.F2 = T2.F2 AND T1.F3 != T2.F3 AND ... )
UNION ALL
( ... )
Note: I used SQL pseudocode to show what I mean for the case where all except one fields are identical. F
stands for field, T
for table, but it would be an index in ElasticSearch.
Calculating dendrograms or using another similarity measure which compares each document which every other one gives me a computational effort of N·(N-1)
and is therefore not feasible.
The approach I am considering for the 2nd part of the issue is to probe my data set with m
test documents (where m
is much smaller than N
), add up ElasticSearch's score over all m
queries. That would give me O(m·N) as computational effort, but I still would have to sort all N
score sums, at least partially, or on the fly.
Are there algorithms other than More Like This
or Fuzzy Query
for this problem? Links to scientific papers are also appreciated!
References
- https://en.wikipedia.org/wiki/Data_deduplication just as an introduction
- https://discuss.elastic.co/t/finding-documents--almost--the-same/66089/2
- https://discuss.elastic.co/t/using-fuzzy-query-to-find-near-duplicates/39075 - a question on the forum without any answer
- https://www.compose.com/articles/how-scoring-works-in-elasticsearch/
- https://betterexplained.com/articles/sorting-algorithms/ for order of the different standard searching algorithms