PatternTokenizerFactory and stopwords

Question

an document field in solr/lucene called COLORS has group of words like this:

field1: blue/dark red/green field2: blue/yellow/orange [...]

I need to run an faceted search over that to get all the colors and the count of each color. First I tried the PatternTokenizerFactory, followd by the stopword-list:

<analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="/" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords"
        enablePositionIncrements="true"
        />
</analyzer>

Unfortunately the stopword list seams to be ignored. Stopwords are showing up in faceted search result.

This SO question describes the same problem. Unfortunately the posted solution doen't work for me, because i can not use the solr.StandardTokenizerFactory, because the standard tokenizer also split tokens on whitspaces. That means "dark red" becomes "dark" and "red" which is wrong.

Is there any way to use the pattern tokenizer?

Thnak you for any kind of help!

The Bndr The Bndr · Accepted Answer · 2011-07-18T09:27:03

For your information: facet, pattern tokenizer and stopwords will work in lucene / solr 4 :-)

PatternTokenizerFactory and stopwords

1 Answers