How to search special characters(+ ! \ ? : ) in Lucene

Question

I want to search special characters in index.

I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +().

Hence it search on no fields.

How to solve this problem? My index contains these special characters.

PLease give an example of what you are searching and what is created. What do you mean with "query as +"? — morja
I am searching for special characters like + ! ? etc. Well I got the solution. Actually we are using some custom analyzer and because of filters applied it was giving blanck query( +() ). But when i used KeywordAnalyzer it worked. Any how any input on it? — user660024
Are you using the same analyzer for indexing and for the query? Please add a code example describing your exact query and how you process it before calling the search. — Yuval F

Gene Golovchinsky Gene Golovchinsky · Accepted Answer · 2011-05-25T04:36:06

If you are using the StandardAnalyzer, that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer and see if that preserves the characters you need. It might also keep stuff you don't want: that's when you might consider writing your own Analyzer, which basically means creating a TokenStream stack that does exactly the kind of processing you need.

For example, the SimpleAnalyzer implements the following pipeline:

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
   return new LowerCaseTokenizer(reader);
}

which just lower-cases the tokens.

The StandardAnalyzer does much more:

/** Constructs a {@link StandardTokenizer} filtered by a {@link
StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(enableStopPositionIncrements, result, stopSet);
    return result;
 }

You can mix & match from these and other components in org.apache.lucene.analysis, or you can write your own specialized TokenStream instances that are wrapped into a processing pipeline by your custom Analyzer.

One other thing to look at is what sort of CharTokenizer you're using. CharTokenizer is an abstract class that specifies the machinery for tokenizing text strings. It's used by some simpler Analyzers (but not by the StandardAnalyzer). Lucene comes with two subclasses: a LetterTokenizer and a WhitespaceTokenizer. You can create your own that keeps the characters you need and breaks on those you don't by implementing the boolean isTokenChar(char c) method.

How to search special characters(+ ! \ ? : ) in Lucene

2 Answers