23
votes

Here is the thing. I have a term stored in the index, which contains special character, such as '-', the simplest code is like this:

Document doc = new Document();
doc.add(new TextField("message", "1111-2222-3333", Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);

And then I create a query using QueryParser, like this:

String queryStr = "1111-2222-3333";
QueryParser parser = new QueryParser(Version.LUCENE_36, "message", new StandardAnalyzer(Version.LUCENE_36));
Query q = parser.parse(queryStr);

And then I use a searcher to search the query and get no result. I have also tried this:

Query q = parser.parse(QueryParser.escape(queryStr));

And still no result.

Without using QueryParser and instead using TermQuery directly can do what I want, but this way is not flexible enough for user input texts.

I think maybe the StandardAnalyzer did something to omit the special character in the query string. I tried debug, and I found that the string is splited and the actual query is like this:"message:1111 message:2222 message:3333". I don't know what exactly lucene has done...

So if I want to perform the query with special character, what should I do? Should I rewrite an analyzer or inherit a queryparser from the default one? And how to?...

Update:

1 @The New Idiot @femtoRgon, I've tried QueryParser.escape(queryStr) as stated in the problem but it still doesn't work.

2 I've tried another way to solve the problem. I derived a QueryTokenizer from Tokenizer and cut the word only by space, pack it into a QueryAnalyzer, which derives from Analyzer, and finally pass the QueryAnalyzer into QueryParser.

Now it works. Originally it doesn't work because the default StandardAnalyzer cut the queryStr according to default rules(which recognize some of the special characters as splitters), when the query is passed into QueryParser, the special characters are already deleted by StandardAnalyzer. Now I use my own way to cut the queryStr and it only recognize space as splitter, so the special characters remain into the query waiting for processing and this works.

3 @The New Idiot @femtoRgon, thank you for answering my question.

2
Apologies, I obviously didn't read carefully enough. I am confused though: Where is this TextField coming from? Lucene's TextField does not take a Field.Index argument (Field.Index is deprecated). To create a field like what you have here, you would instead use a StringField. Is this some sort of custom TextField or something?femtoRgon
Sorry, that's my fault. I am using Lucene 3.6 and there is no TextField in Lucene 3.x. The correct code should be:doc.add(new Field("message", "1111-2222-3333", Field.Store.YES, Field.Index.NOT_ANALYZED)); Lucene 4.x and 3.x APIs are very different, I'm still trying to understand lucene 4.x APIs.Yuanchao Tang
Ah, makes more sense. A bit off topic, but if you are trying to get a handle on the changes in 4.x have you looked at the migration guide? It calls out the major changes, as well as providing some rationale.femtoRgon
Oh I haven't seen that. I'll look into that later, it will be very helpful. Thanks a lot:)Yuanchao Tang
@Yuanchao-tang How WhitespaceAnalyzer performs differently from yours?? I derived a QueryTokenizer from Tokenizer and cut the word only by spacerrsk

2 Answers

23
votes

I am not sure about this , but I guess you need to escape - with \ . As per the Lucene docs.

The "-" or prohibit operator excludes documents that contain the term after the "-" symbol.

Again ,

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

To escape these character use the \ before the character.

Also remember, some characters you'll need to escape twice if they have special meaning in Java.

0
votes

you can add the value as addValue() instead of add or addText. and then search in the special character with a KyewordAnalyzer instead of Standard Analyzer. or Add the data with addValue() and while searching the data in luke, replace the special character with the wild card search character (?). I have tried both ways and works