1
votes

In my Lucene index I store names with special characters (e.g. Savić) in a field like the one described below.

FieldType fieldType = new Field();
fieldType.setStored(true);
fieldType.setIndexed(true);
fieldType.setTokenized(false);<br>
new Field("NAME", "Savić".toLowerCase(), fieldType);

I use a StopwordAnalyzerBase analyzer and Lucene Version.LUCENE_45.

If I search in the field exactly for "savić" it doesn't find it. How to deal with the special characters?

@Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
PatternTokenizer src;
// diese Zeichen werden nicht als Trenner verwendet
src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);

TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);

return new TokenStreamComponents(src, tok) {
    @Override
    protected void setReader(final Reader reader) throws IOException {
        super.setReader(reader);
    }
};

}

1
StopwordAnalyzerBase is abstract. Implementations of it are quite varied, and include most of the commonly used analyzers. Are you using a custom implementation of it as your analyzer, or what?femtoRgon
Sorry for beeing nonspecific. Yes I use a custom implementation which basically just extends the StopwordAnalyzerBase.Smoose28
What does your CreateComponents do?femtoRgon
@femtoRgon I added the createComponents method to the post. Thank you in advance! I'm an absolute newbie to Lucene and really lost.Smoose28

1 Answers

0
votes

You have a couple of choices:

  1. Try adding an ASCIIFoldingFilter:

    src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
    
    TokenStream tok = new StandardFilter(matchVersion, src);
    tok = new LowerCaseFilter(matchVersion, tok);
    tok = new ASCIIFoldingFilter(tok);
    tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);
    

    This will take a fairly simplistic approach of reducing non-ASCII characters, such as Ä, to their best match in ASCII characters (A, in this case), if a reasonable ASCII alternative character exists. It won't do anything fancy with trying to use language-specific intelligence to determine the best replacements though.

  2. For something more linguistically intelligent, there are tools to handle this sort of thing in many of the language-specific packages. The GermanNormalizationFilter would be one example, which will do similar things to the ASCIIFoldingFilter, but will apply the rules in a way that is appropriate to the German language, such as 'ß' being replaced by 'ss'. You'd use it similar to the above code:

    src = new PatternTokenizer(reader, Pattern.compile("[\\W&&[^§/_&äÄöÖüÜßéèàáîâêûôëïñõãçœ◊]]"), -1);
    
    TokenStream tok = new StandardFilter(matchVersion, src);
    tok = new LowerCaseFilter(matchVersion, tok);
    tok = new GermanNormalizationFilter(tok);
    tok = new StopFilter(matchVersion, tok, TRIBUNA_WORDS_SET);