In my Lucene index I store names with special characters (e.g. Savić) in a field like the one described below.
FieldType fieldType = new Field();
fieldType.setStored(true);
fieldType.setIndexed(true);
fieldType.setTokenized(false);<br>
new Field("NAME", "Savić".toLowerCase(), fieldType);
I use a StopwordAnalyzerBase analyzer and Lucene Version.LUCENE_45.
If I search in the field exactly for "savić" it doesn't find it. How to deal with the special characters?
@Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
PatternTokenizer src;
// diese Zeichen werden nicht als Trenner verwendet
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
return new TokenStreamComponents(src, tok) {
@Override
protected void setReader(final Reader reader) throws IOException {
super.setReader(reader);
}
};
}
StopwordAnalyzerBase
is abstract. Implementations of it are quite varied, and include most of the commonly used analyzers. Are you using a custom implementation of it as your analyzer, or what? – femtoRgonCreateComponents
do? – femtoRgon