Lucene - How to index a value with special characters

Question

I have a value I am trying to index that looks like this:

Test (Test)

Using a StandardAnalyzer, I attempted to add it to my document using:

Field.Store.YES, Field.Index.TOKENIZED

When I do a search with the value of 'Test (Test)' my QueryParser generates the following tags:

+Name:test +Name:test

This operates as I expect because I am not escaping special characters.

However, if I do QueryParser.Escape('Test (Test)') while indexing my value, it creates the terms:

[test] and [test]

Then when I do a search like such:

 QueryParser.Escape('Test (Test)')

I get the same two terms (as I expect). The problem is if I have two documents indexed with the names:

Test
Test (Test)

It matches on both. If I specify a search value of 'Test (Test)' then I want to just get the second document. I am curious as to why escaping the special characters does not preserve them in the created terms. Is there an alternate Analyzer I should look at? I looked at WhitespaceAnalyzer and KeywordAnalyzer. WhitespanceAnalyzer is case sensitive and KeywordAnalyzer stores it as a single term of:

[Test (Test)]

Which means that if I do a search for just 'Test' I will not be able to return both documents.

Any ideas on how to implement this? It doesn't seem like it should be that difficult.

Pascal Dimassimo Pascal Dimassimo · Accepted Answer · 2010-04-29T19:41:33

If you search for 'Test (Test)' and you want to retrieve documents that contains that exact expression, you must enclose the search expression between "..." so that Lucene knows that you want to do a phrase search.

See the Lucene documentation for details:
http://lucene.apache.org/java/3_0_1/queryparsersyntax.html#Terms

Lucene - How to index a value with special characters

1 Answers