7
votes

The Dutch and German language do have words that can be combined to new words; compound words.

For example "accountmanager" is considered one word, compounded by the words "account" and "manager". Our users, will use "accountmanager" and "account manager" in documents and queries, and expect the same results for both queries.

To be able to decompound (split) words, solr has a dictionary filter that I have configured in the schema:

<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="../../compound-word-dictionary.txt" minWordSize="8" minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>

The compound-word-dictionary.txt file holds a list of words that are used to decompound compounded words. In this list you will find for example the words "account" and "manager".

The decompound result is ok, when analyzed in the Solr debugger when searching with query "accountmanager": (term text):

  • accountmanager
  • account
  • manager

This result however, is treated as an OR statement, and finds all documents that have at least one of the terms in it. I want it to behave like an AND statement (so I want only the results that have both the terms "account" and "manager" in the document).

I have tried setting the defaultOperator in the schema to "AND", but this is ignored when using edismax. So I have set the proposed Min-should-Match to 100% (mm=100%), again without any desired result. Tweaking the attributes of the dictionary filter in the schema does not change the behavior to "AND".

Does anybody came across this behavior when using the dictionary compound word token factory and knows a solution to let it behave like an AND statement?

2

2 Answers

4
votes

it is working as expected, the DictionaryCompoundWordTokenFilterFactory is just adding the 'inner words' it found, in this case both 'account' and 'manager' but could have been just one, if for example the word was 'accountbanana' and 'banana' is not in the dictionary only 'account' would have been added.

This serves the purpose of someone looking for 'manager' and also finding the doc that has 'accountmanager'.

In order to get the behaviour you want (I understand you are applying this on the query side) you could use a dictionary that makes accountmanager="account manager"

4
votes

Just a heads up as I'm taking a look into this, there is alot of added noise when doing this. Since SOLR 3.6 sets the position increment of each broken token to 0 in CompoundWordTokenFilterBase, you will get queries that index correctly (sort of). Yet when querying, you will get a giant OR query of your compounded word because AnalyzerQueryNodeProcessor only checks if positionCount==1.

For example a search for Castaway will query for (castaway or cast or away). This adds alot of noise, where the movie Castaway (which is really Cast Away) will work, but you also get everything that has just "Away" or just "Cast".

We have actually patched Lucene to setPositionIncrement to 1 and added some extra code in AnalyzerQueryNodeProcessor so that there are OR'd PhraseQueryNodes where you will get ("castaway" or "cast away"). This is also incorrect, but reduces the noise. Phrase queries can return weird results if you set position always to 1, since (castaway0, cast1, away2), can return results of "castaway away". Also the positions of later terms is now off. For a better description, see: http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html