11
votes

I'm using a snowball analyzer to stem the titles of multiple documents. Everything works well, but their are some quirks.

Example:

A search for "valv", "valve", or "valves" returns the same number of results. This makes sense since the snowball analyzer reduces everything down to "valv".

I run into problems when using a wildcard. A search for "valve*" or "valves*" does not return any results. Searching for "valv*" works as expected.

I understand why this is happening, but I don't know how to fix it.

I thought about writing an analyzer that stores the stemmed and non-stemmed tokens. Basically applying two analyzers and combining the two token streams. But I'm not sure if this is a practical solution.

I also thought about using the AnalyzingQueryParser, but I don't know how to apply this to a multifield query. Also, the using AnalyzingQueryParser would return results for "valve" when searching for "valves*" and that's not the expected behavior.

Is there a "preferred" way of utilizing both wildcards and stemming algorithms?

4

4 Answers

10
votes

I used 2 different approach to solve this before

  1. Use two fields, one that contain stemmed terms, the other one containing terms generated by say, the StandardAnalyzer. When you parse the search query if its a wildcard search in the "standard" field, if not use the field with stemmed terms. This may be harder to use if you have the user input their queries directly in the Lucene's QueryParser.

  2. Write a custom analyzer and index overlapping tokens. It basically consist of indexing the original term and the stem at the same position in the index using the PositionIncrementAttribute. You can look into SynonymFilter to get some example of how to use the PositionIncrementAttribute correctly.

I Prefer solution #2.

1
votes

I don't think that there is an easy(and correct) way to do this.

My solution would be writing a custom query parser that finds the longest string common to the terms in the index and to your search criteria.

class MyQueryParser : Lucene.Net.QueryParsers.QueryParser
{
    IndexReader _reader;
    Analyzer _analyzer;

    public MyQueryParser(string field, Analyzer analyzer,IndexReader indexReader) : base(field, analyzer)
    {
        _analyzer = analyzer;
        _reader = indexReader;
    }

    public override Query GetPrefixQuery(string field, string termStr)
    {
        for(string longestStr = termStr; longestStr.Length>2; longestStr = longestStr.Substring(0,longestStr.Length-1))
        {
            TermEnum te = _reader.Terms(new Term(field, longestStr));
            Term term = te.Term();
            te.Close();
            if (term != null && term.Field() == field && term.Text().StartsWith(longestStr))
            {
                return base.GetPrefixQuery(field, longestStr);
            }
        }

        return base.GetPrefixQuery(field, termStr);
    }
}

you can also try to call your analyzer in GetPrefixQuery which is not called for PrefixQuerys

TokenStream ts = _analyzer.TokenStream(field, new StringReader(termStr));
Lucene.Net.Analysis.Token token = ts.Next();
var termstring = token.TermText();
ts.Close();
return base.GetPrefixQuery(field, termstring);

But, be aware that you can always find a case where the returned results are not correct. This is why Lucene doesn't take analyzers into account when using wildcards.

1
votes

This is the simplest solution and it would work -

Add solr.KeywordRepeatFilterFactory in your 'index' analyser.

http://lucene.apache.org/core/4_8_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html

Also add RemoveDuplicatesTokenFilterFactory at the end of the 'index' analyzer

Now in your index you will always have the stemmed and the non stemmed form for each token on the same position and you are good to go.

0
votes

The only potential idea I have beyond the other answers is to use the dismax against the two fields, so you can just set the relative weights of the two fields. The only caveat is that some versions of dismax didn't handle wildcards, and some parsers are Solr specific.