0
votes

Am posting this question again as my query is not answered.

Am working on a book search api using Lucene. User can search for a book whose title or description field contains C.F.A... Am using StandardAnalyzer alongwith a list of stop words.

Am using MultiFieldQueryParser for parsing above string.But after parsing, its removing the dots in the string. What am i missing here?

Thanks.

2

2 Answers

7
votes

As you mentioned, this is a dupe of this question. I suggest you at least add a link to it in your question. Also, I would urge you to create a user account, since right now it's not possible to look at your old question to get context.

The StandardAnalyzer specifically handles acronyms, and converts C.F.A. (for example) to cfa. This means you should be able to do the search, as long as you make sure you use the same analyzer for the indexing and for the query parsing.

I would suggest you run some more basic test cases to eliminate other factors. Try to user an ordinary QueryParser instead of a multi-field one.

Here's some code I wrote to play with the StandardAnalyzer:

StringReader testReader = new StringReader("C.F.A. C.F.A word");
StandardAnalyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("title", testReader);
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());
System.out.println(tokenStream.next());

The output for this, by the way was:

(cfa,0,6,type=<ACRONYM>)
(c.f.a,7,12,type=<HOST>)
(word,13,17,type=<ALPHANUM>)

Note, for example, that if the acronym doesn't end with a dot then the analyzer assumes it's an internet host name, so searching for "C.F.A" will not match "C.F.A." in the text.

1
votes

(I'm only familiar with java lucene, but I imagine that it doesn't matter in this case.)

The purpose of the analyzers is to strip away characters and formatting that prevents effective full text search. For example, if you write a document where you only refer to lucene as "lucene.net", you'd probably want lucene to return search hits for only "lucene" as well. Therefore the StandardAnalyzer strips the dots (as well as some other special characters).

Don't worry though. As always with lucene this can be configured, in this case by choosing a different analyzer. Try using SimpleAnalyzer or KeywordAnalyzer instead, and see which one is closest to your desired behaviour. If neither of them will do, you can even implement your own custom analyzer using the analyzer interface. It's actually quite simple.

Good luck. :)