Lucene: Search for numbers with characters like % attached to them

Question

I am using Lucene to Index documents and Search for values like $5000 and 90%, but in my search results, I find that the standardanalyzer deletes the $ and % while indexing the code. So I just have a plain number without the $ and % symbols. I've tried the whitespace analyzer and the simple analyzer but they don't consider numbers. Is there anyway to make the StandardAnalyzer not delete the $ and % in my indexed documents?

My current indexwriter code looks like this:

private IndexWriter createWriter() throws IOException {
    FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
    IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
    IndexWriter writer = new IndexWriter(dir, config);
    return writer;
  }

Sabir Khan Sabir Khan · Accepted Answer · 2017-06-28T07:36:11

First of all - as far as indexing or searching is concerned, why do you need those special characters in your index? I guess, your search will work perfectly OK without those symbols.

Also, IMHO, if those are numeric values, you shouldn't be using String or Text field types and probably, that is why you need those symbols at first place. If you are trying to build something for numeric data, you should try using fields LongPoint , DoublePoint etc .

Having said that, what you are asking is achievable with SOLR but not with plain lucene ( as far as I know ) unless you are willing to write your own analyzer.

Basically, SOLR lets you do configuration for your ananlyzers - Using StandardTokenizerFactory with currency - that you wouldn't be able to do by directly using - StandardAnalyzer or SimpleAnalyzer because they do what they do - that can't be customized.

You can use builder of org.apache.lucene.analysis.custom.CustomAnalyzer CustomAnalyzer Javadoc to build your custom analyzer. An analyzer basically consists of a tokenizer and multiple filters.

I am not aware of any but you can start by browsing dependency -

<dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
</dependency>

to see if there is any analyzer or tokenizer for your need.

But again, I feel that you wouldn't need those symbols in your index - that can be achieved by doing some pre & post processing for indexing and searching.

How to index words with special character in Solr

Lucene: Search for numbers with characters like % attached to them

1 Answers