6
votes

Elasticsearch allows for searching of similar documents via its "more-like-this" (MLT) query. I'm trying to better understand and tune the query to find similar documents better.

While experimenting with it, I've found that the result from a single MLT query with multiple fields yields different results from a boolean of multiple MLT queries with one field each. Samples below (truncated):

Single MLT query with multiple fields

es.search(index=INDEX_NAME, body = {'query': {
    "more_like_this" : {
        "fields" : ['title', 'category_name', 'brand'],
        "like" : []
        }
    }
})

Multiple MLT queries with single field

es.search(index=INDEX_NAME, body = {'query': {
    'bool': {
                'should': [
                    {'more_like_this' : {
                    'fields' : ['title'],
                    'like' : [],
                    }},

                    {'more_like_this' : {
                    'fields' : ['category_name'],
                    'like' : [],
                    }},

                    {'more_like_this' : {
                    'fields' : ['brand'],
                    'like' : [],
                    }},
                ]
            }
    }
})

Why does this happen?

I understand that the MLT query would combine the text from all the fields listed in a single query to search through the documents. However, there is no overlap of text in the title, category_name, and brand field. Thus, the results should be the same. However, the results are different--the multiple MLT queries works better btw.

I apologise if this question has no straight forward solution. I'm looking for greater understanding from elastic gurus on how to improve returned queries.

If you have time, here's a previous question I posted on MLT which remains unanswered: Elasticsearch "more_like_this" query specific to fields

1

1 Answers

0
votes

If I understand correctly, the normalization process happens within fields vs across fields in the two different cases. The score gets normalized by the length of the field string, the number of occurrences, etc. If this varies widely across fields then you would not expect the result of the two queries to be the same.