2
votes

So I have a task to make some kind of type-ahead using Lucene 6. Basic requirements:

  • Queries should match partial words, not whole tokens. If I have "sum of sales" string indexed then query "sum of sa" should match.

  • Relevancy should come out of the box or be easy to implement. Matches from start of the indexed string should have higher score that ones with match in the middle. Full match has highest score. Etc.

So far i've tried:

  • PhraseQuery which has acceptable relevancy from out of the box, but does not match partial words.

  • Combining PhraseQuery for all words in query and WildcardQuery for last(possibly incomplete word) using BooleanQuery. This will match those two parts in any order. So isn't good for me.

  • Indexing separate copy of a field without tokenizing and using PrefixQuery and WildcardQuery. They don't give scoring I would like to have from out of the box.

Is there any approach I missed that could save my day and possibly next week?

2
i think in most of the cases approach number 2 should be OK, also I'm curious, if you could share examples, where this behavior (any order of tokens) could return false positive resultsMysterion
also, what's exactly wrong with approach 3?Mysterion
But I explained what's wrong...Aleksandr Kravets
I'll try again. Second approach will match last part(wildcard) even if it happened in text BEFORE other terms(parts of phrase). And I need all terms to be matched in order.Aleksandr Kravets
Third case is not giving me any useful scoring. Consider two strings test string and test string zero. If I search with PrefixQuery for test string I'll get both documents with those strings and each will have score=1. That is order they will be returned to me will most likely to be an order of adding to index. But I expect full match to be more relevant that partial(have higher score).Aleksandr Kravets

2 Answers

0
votes

Simplest solution - if you're using approach number 3, you could get proper scoring by setting RewriteMethod for MultiTermQuery (both PrefixQuery and WildcardQuery are sub-classes of MultiTermQuery)

MultiTermQuery query = new WildcardQuery(new Term("field1", "sum of sa*"));
            query.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

Possible problem with this method, that it could hit BooleanQuery.TooManyClauses exception, than you will need to implement your proper RewriteMethod, which will somehow preserve performance of constant scoring solution and incorporates the power of scoring.

0
votes

I don't think any of the proposed solutions meet performance requirements of a typeahead, moreover there is a best practice with EdgeNGramTokenFilter. Such filter will tokenizes the input:

sum of sales

to

s u m o f s a l e s

An example of filter to use would be:

new EdgeNGramTokenFilter(result, Side.FRONT, 1, 20);

Where result is your TokenStream input, Side.FRONT starts chopping from the start of the input, 1 is the character of the input you'll start chopping and 20 is the maximum. There is plenty of more detailed examples around and this is the solution you want to use for your typeAhead.