How to extend Lucene's StandardAnalyzer for custom special character treatment?

Question

I'm using Lucene's StandardAnalyzer for a specific index property. As special characters like àéèäöü do not get indexed as expected, I want to replace these characters:

à -> a
é -> e
è -> e
ä -> ae
ö -> oe
ü -> ue

What is the best approach to extend the org.apache.lucene.analysis.standard.StandardAnalyzerclass?

I was looking for a way where the standard parser iterates over all tokens (words) and I can retrieve word by word and do the magic there.

Thanks for any hints.

It would be easier to help you out if you showed what all methods are you calling from StandardAnalyzer. (FYI you can't extend it since the class is final). That said, it looks like StandardAnalyzer has a constructor that takes a Reader. You could probably take advantage of this and pass it a custom reader? — Chetan Kinger

Mysterion Mysterion · Accepted Answer · 2017-02-02T16:02:06

I would propose to use MappingCharFilter, that will allow to have a map of Strings that will be replaces by Strings, so it will fit your requirements perfectly.

Some additional info - https://lucene.apache.org/core/6_0_0/analyzers-common/org/apache/lucene/analysis/charfilter/MappingCharFilter.html

How to extend Lucene's StandardAnalyzer for custom special character treatment?

2 Answers