9
votes

Given a phrase match query like this:

{
    'match_phrase': {
        'text.english': {
            'query': "The fox jumped over the wall",
            'phrase_slop': 4,
        }
    }
}

Is there a way I can group results by the exact match?

So if I have 1 document with text.english containing "The quick fox jumps over the small wall" and 3 documents containing "The lazy fox jumped over the big wall", I end up with those two groups of results.

I'm OK with running multiple queries and doing some processing outside of ES, but I need a solution that performs reasonably over a large set of documents. Ideally I'm hoping there's a way to do this using aggregations that I've missed.

The best solution I've come up with is to run the query above with highlights, parse out all of the highlights from all of the results, and group them based on highlight content. This is fine for very small result sets, however over a 1000+ document result set it is prohibitively slow.

EDIT: Maybe I can make this a bit clearer. If I have sample documents with the following values:

  1. "The quick fox jumps over the small wall. Blah blah blah many pages of unrelated text."
  2. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."
  3. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."
  4. "The lazy fox jumped over the big wall. Blah blah blah many pages of unrelated text."

I want to be able to group my results as follows with query text "The fox jumped over the wall":

  • "The quick fox jumps over the small wall" - Document 1
  • "The lazy fox jumped over the big wall" - Documents 2, 3, 4
4
What are you trying to achieve? From those two sample documents, can you explain what should be the desired outcome?Andrei Stefan
Ok, so you want your query to match, but the results should be grouped by the text they matched? A simple aggregation on the text.english.raw should do it (where .raw is a not_analyzed subfield).Andrei Stefan
Exactly, I want to group the results by the exact match text. I have both an analysed and a raw copy of each doc. How does the aggregation work though? I couldn't find one that would do that.Cole Maclean
"The lazy fox jumped over the big wall" this is the text that was indexed initially. Do you want to group based on this text or on something else? What if your text has 5 lines, do you want to group on this entire text?Andrei Stefan
I want to group based on the match, not the entire text.Cole Maclean

4 Answers

2
votes

If the statements inside your text.english are "exactly" same then their score should be same. You could aggregate results based on Elastic Search _score.

Please refer to this SO question ElasticSearch: aggregation on _score field?

Since ES has disabled the dynamic scripting, this might help. ElasticSearch: aggregation on _score field w/ Groovy disabled

2
votes

In my opinion, highlighting is the only option because it's the only way Elasticsearch will show which "parts" of text matched. And in your case, you want to group documents based on what "matched.

If the text would have been shorter (like few words), maybe a more involved solution would have been to split the text in a shingle-kind of way and somehow group on those phrases... maybe.

But for pages of text, I think the only option is to use highlighting and perform additional steps afterwards to group the highlighted parts.

0
votes

I have a similar problem/challenge in a product search application. I want to group products by brand, e.g.

Nikon
Nikos

To solve this problem I'm experimenting with the Suggester . The idea behind is that the suggester will provide me with suggestions for my searches. The suggestions will be grouped and will not be repeated for all documents (even though there is possibly some other text around them). You can use a Term Suggester or a Phrase Suggester

This approach, however, requires you probably to change the handling of the results. You have to display the suggestions as the groups and handle search results separately. The advantage of this approach is that you don't have to do the grouping yourself.

Another solution is to use a Terms Aggregation using shingles. This aggregation would group word groups (shingles). To get your result, however, you have to take all aggregations and match them with your query input. See example mapping, data and query:

PUT /so
{
   "settings": {
      "analysis": {
         "analyzer": {
            "suggestion_analyzer": {
               "tokenizer": "standard",
               "filter": [
                  "lowercase"
               ]
            },
            "analyzer_shingle": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "filter_shingle"
               ]
            }
         },
         "filter": {
            "filter_shingle": {
               "type": "shingle",
               "min_shingle_size": 4,
               "max_shingle_size": 16,
               "output_unigrams": "false"
            }
         }
      }
   },
   "mappings": {
      "d": {
         "properties": {
            "text": {
               "properties": {
                  "english": {
                     "type": "string",
                     "fields": {
                        "shingles": {
                           "type": "string",
                           "analyzer": "analyzer_shingle"
                        },
                        "suggest": {
                           "type": "completion",
                           "index_analyzer": "analyzer_shingle",
                           "search_analyzer": "analyzer_shingle",
                           "payloads": true
                        }
                     }
                  }
               }
            }
         }
      }
   }
}

Document 1:

POST /so/d/1
{
    "text": {
        "english": "The quick fox jumps over the big wall. JJKJKJKJ"
    }
}

Document 2:

POST /so/d/2
{
    "text": {
        "english": "The quick fox jumps over the small wall. JJKJKJKJ"
    }
}

Document 3:

POST /so/d/3
{
    "text": {
        "english": "The quick fox jumps over the gugus wall. LLLLLLL"
    }
}

Query:

POST /so/_search
{
    "size": 0,
    "query": {
        "match": {
           "text.english": "The quick fox jumps over the wall"
        }
    }, 
    "aggs" : {
        "states" : {
            "terms" : {
                "field" : "text.english.shingles",
                "size": 40
            }
        }
    }
}
-1
votes

I believe you could create a terms aggregation over a not analyzed version of the field.

if text.raw is defined as not_analyzed, an aggregation should take the whole field value.

I have not tested it, but I found something quite similar: ElasticSearch terms aggregation by entire field