3
votes

I'm using Lucene's StandardAnalyzer for a specific index property. As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:

  • à -> a
  • é -> e
  • è -> e
  • ä -> ae
  • ö -> oe
  • ü -> ue

What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?

I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.

Thanks for any hints.

2
It would be easier to help you out if you showed what all methods are you calling from StandardAnalyzer. (FYI you can't extend it since the class is final). That said, it looks like StandardAnalyzer has a constructor that takes a Reader. You could probably take advantage of this and pass it a custom reader?Chetan Kinger

2 Answers

3
votes

I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.

Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html

0
votes

You wouldn't extend StandardAnalyzer, since analyzer implementations are final. The meat of an analyzer implementation is the createComponents method, and you would have to override that anyway, so you wouldn't get much from extending it anyway.

Instead, you can copy the StandardAnalyzer source, and modify the createComponents method. For what you are asking for, I recommend adding ASCIIFoldingFilter, which will attempt to convert UTF characters (such as accented letters) into ASCII equivalents. So you could create an analyzer something like this:

Analyzer analyzer = new Analyzer() {
  @Override
  protected TokenStreamComponents createComponents(final String fieldName) {
    final StandardTokenizer src = new StandardTokenizer();
    src.setMaxTokenLength(maxTokenLength);
    TokenStream tok = new StandardFilter(src);
    tok = new LowerCaseFilter(tok);
    tok = new ASCIIFoldingFilter(tok); /*Adding it before the StopFilter would probably be most helpful.*/
    tok = new StopFilter(tok, StandardAnalyzer.ENGLISH_STOP_WORDS_SET);
    return new TokenStreamComponents(src, tok) {
      @Override
      protected void setReader(final Reader reader) {
        src.setMaxTokenLength(StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
        super.setReader(reader);
      }
    };
  }

  @Override
  protected TokenStream normalize(String fieldName, TokenStream in) {
    TokenStream result = new StandardFilter(in);
    result = new LowerCaseFilter(result);
    tok = new ASCIIFoldingFilter(tok);
    return result;
  }
}