1
votes

I have a document which has the below data:

Hello World and 

bmw Master World

Hello

So the documents contains 3 lines as shown above and I have indexed the document to my elastic search server. I am using the below match_phrase query to search for exact phrase:"World Hello".

:query=>{ :match_phrase=>{ :text=> "World Hello" } }

Surprisingly, It returns the above doc.

Point to be noted is that this document does not contain the phrase "World Hello". But 2nd line ends with "World" and 3rd line starts with "Hello". Is that the reason why the above document matches the query.

1
so, it's a 3 lines document? could you show your mappings? but, im sure it's because for index it's just a big line of words separated by separatorsMysterion

1 Answers

0
votes

You're probably going to want to read up a little about how analysis works.

Also take a look at this description of phrase matching. The terms in the phrase don't have to appear in the exact sequence of your query, the first one just has to appear before the second one. Since there is a "hello" that comes after "world", the document matches your query.

Also note that the standard analyzer is used here, both in indexing the document and in analyzing the query, since no other analyzers were specified. You can customize this behavior if you wish.

As a quick example, I created a trivial index:

PUT /test_index

then indexed your document (with newlines escaped):

PUT /test_index/doc/1
{
    "doc_text": "Hello World and \n\nbmw Master World\n\nHello"
}

then indexed another one with the last "Hello" removed:

PUT /test_index/doc/2
{
    "doc_text": "Hello World and \n\nbmw Master World"
}

Now if I run your query, only the first document is returned:

POST /test_index/_search
{
   "query": {
      "match_phrase": {
         "doc_text": "World Hello"
      }
   }
}  
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.4459011,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.4459011,
            "_source": {
               "doc_text": "Hello World and \n\nbmw Master World\n\nHello"
            }
         }
      ]
   }
}

You can prove to yourself why this happens using term vectors. I won't go into it here, but here's some code you can use to investigate if you want to:

http://sense.qbox.io/gist/3ee955b8389d1b36ea56788654955c519e2bb429