Apache Solr phrase query is not aware of filters from schema.xml

Question

I am a newbie with solr and I have a question about query mechanism.

In my solr schema.xml for a particular field (say field1) i have a standard tokenizer that splits into words and a couple of filters. One of the filters is a solr.KeepWordFilterFactory filter that has a extremely short dictionary (just 10 words, say they are: red, orange, yellow, green etc). I tested the schema with analyze menu of solr and everything works.

that is a document with text "Red fox was sitting on green grass". would translate to {"red,"green"}

However, when I submit a query: field1:"red green" it fails to find such a document. As if the query is applied to unfiltered yet tokenized source.

Can you confirm that this is what standard query parser actually does. I.e the filters are applied exclusively for the index, but no for the actual search ??(i understand that the search will be applied only to those documents where the index matches the analyzed query). Or if not how the phrase query actually works in the above example.

The analysis function in the solr admin is your friend. Select the field you wish to do your analysis and enter the text. It will show you exactly what terms of the text will be indexed once it has passed through each filter. — everreadyeddy
yeah, it's correct behavior, you query ask for exact match for "red green", but after analyze your field1 will contains 2 values and it's not the same as search for field1:red AND field1:green — Mysterion
the query "red greed"~10 yields one match (the whole sentence about the fox). "red greed" yields nothing — hellmean
to mysterion. If would have a sentence "red green fox is sitting on bench" the "red green" query would give me a match no problem. Even though after analysis as you say field1 contains two values — hellmean

omu_negru omu_negru · Accepted Answer · 2014-10-13T06:55:46

When you do a query like this : "red green", Lucene expects to find these terms in consecutive positions , so pos(green) = pos(red) + 1. When you do it like this : "red green"~10 , you give it 10 moves to shuffle the terms around and try to make them seem consecutive (it's called a phrase slop) .

Other that that , what a KeywordMarkerFilter does is mark tokens with the keyword flag. Filters following it could implement a logic that check if the token is a keyword before modifying it. It does not stop lucene from indexing tokens not marked as keywords, but it could stop it from further modifying them.

Apache Solr phrase query is not aware of filters from schema.xml

1 Answers