0
votes

I have inherited a project which uses Lucene 4.6.0 to search xml documents.

Basically my problem seems to be this:

Searching a document with a text field containing "otherwise authorized as such" returns highlighted document when searching for any of those words, but if the text field contains something like "[otherwise authorized as such]" then only a search for "authorized" returns a result.

I am guessing that lucene is not seeing "[otherwise" and "such]" as words because of the square brackets? Not being a lucene expert, even with documentation, I am stuck on this. Is there a way to customize an Analyzer to include "[" as part of word searches?

Thanks

1
so, you want to be able to search for terms like [otherwise ?Mysterion
No I would like to search for "otherwise" and have lucene return results "[otherwise" and "otherwise". Currently, I am stripping the square brackets from the documents before writing them to the index, but this is not an ideal solution.cditcher

1 Answers

0
votes

You do not need to manually strip bad characters, you should write custom Analyzer, which will use PatternReplaceCharFilter, which could remove not needed symbols.

The example of this analyzer would be something like this:

class CustomAnalyzer extends Analyzer {

        @Override
        protected Reader initReader(String fieldName, Reader reader) {
            CharFilter cf = new PatternReplaceCharFilter(Pattern.compile("\\["), "", reader);
            cf = new PatternReplaceCharFilter(Pattern.compile("\\]"), "", cf);
            cf = new PatternReplaceCharFilter(Pattern.compile("\\)"), "", cf);
            cf = new PatternReplaceCharFilter(Pattern.compile("\\("), "", cf);
            return cf;
        }

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            final StandardTokenizer analyzer = new StandardTokenizer();
            TokenStream tok = new StandardFilter(analyzer);
            tok = new LowerCaseFilter(tok);
            return new TokenStreamComponents(analyzer, tok);
        }
    }

Here I selected to replace only '' symbols.

After this indexing time filtering, you would be able to search as normal.

Full example of the code is located here