2
votes

I am using lucene for searching and with tags i use the whitespace analyzer. It looks like its stored properly. With standard analyzer my 'C#' search will yield results for C, C++. Every analyzer i tried (i havent tried all) does this except for whitespace analyzer. This is fine except if i search c# i get no results (i'm using a lowercase C instead of uppercase). This is annoying if i search a title such as "Lucene insensitive whitespace analyzer?" when it happens to be "Lucene Insensitive Whitespace analyzer?". (Note the first 3 words start with upper and the last doesnt compared to my search with one upper and all lower).

How do i make an insensitive whitespace analyzer? Note: WhitespaceAnalyzer is sealed.

3

3 Answers

3
votes

You can create a custom analyzer as below (for Lucene version 4.10.4 as an example)

import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;

public class CaseInsensitiveWhitespaceAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String arg0, Reader arg1) {
            Tokenizer tokenizer = new WhitespaceTokenizer(arg1);
            TokenStream filter = new LowerCaseFilter(tokenizer);
            return new TokenStreamComponents(tokenizer, filter);
    }
}

And you can use the analyzer to config your index writer when indexing, and also use it to create your query parser when searching.

2
votes
class CaseInsensitiveWhitespaceAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        var tokenizer = new WhitespaceTokenizer(reader);
        var lowercaseFilter = new LowerCaseFilter(tokenizer);

        return new StopFilter(true, lowercaseFilter, StopAnalyzer.ENGLISH_STOP_WORDS_SET, true);
    }
}

Here's a C# version that works well for my use case.