
I'm using lucene.net and the snowball analyzer in a asp.net application.

With a specific language I'm using I have the following issue: For two specific words with different meanings after they are stemmed the result is the same, therefore a search for any of them will produce results for both things.

How can I teach the analyzer either not to stem this two words or to, although stemming them, know that they have different meanings.


2 Answers


I am working from memory here but as I recall in one of the constructors you can pass an array of stopwords, which will stop the passed in words from being stemmed.


With Lucene 4.0, EnglishAnalyzer now has this ability, since it has a constructor which takes a stemExclusionSet

Of course, Lucene.Net isn't up to Lucene 4 yet, so fat lot of good that does.

However, EnglishAnalyzer does this by using a KeywordMarkerFilter. So you can create your own Analyzer, overriding the tokenStream method, and adding into the chain a KeywordMarkerFilter just before the SnowballFilter.

Something like:

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    if (stopSet != null)
        result = new StopFilter(result, stopSet);
    result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new SnowballFilter(result, name);
    return result;

You'll need to construct your own stemExclusionSet (see CharArraySet).