I have a product database (for soccer) which contains about 5k products. The Lucene index for product search currently contains Name, Category, Color and Numbers (ArtNo and EANs).
Relevant table example for the problem:
| Name | Color | -------------------------------------------- | Nike Training football | red black | | Nike Match football | black white | --------------------------------------------
For the index I have created a custom Analyzer, so I can extend a StandardAnalyzer with additional behavior. The creation of the stream looks like this at the moment:
TokenStream result = new StandardTokenizer(Util.Version.LUCENE_29, reader );
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, stoptable );
return result;
The Analyzer is used both for the Indexer and the Searcher.
This is the current search logic:
BooleanQuery booleanQuery = new BooleanQuery(true);
string[] terms = query.Split(' ');
foreach (string s in terms)
{
BooleanQuery subQuery = new BooleanQuery(true);
var nameQuery = new FuzzyQuery(new Term("Name", s), 0.9f);
nameQuery.SetBoost(6);
subQuery.Add(nameQuery, BooleanClause.Occur.SHOULD);
var colorQuery = new TermQuery(new Term("Color", s));
subQuery.Add(colorQuery, BooleanClause.Occur.SHOULD);
var categoryQuery = new FuzzyQuery(new Term("Category", s), 0.9f);
categoryQuery.SetBoost(2);
subQuery.Add(categoryQuery, BooleanClause.Occur.SHOULD);
var numbersQuery = new TermQuery(new Term("Numbers", s));
numbersQuery.SetBoost(10);
subQuery.Add(numbersQuery, BooleanClause.Occur.SHOULD);
booleanQuery.Add(subQuery, BooleanClause.Occur.MUST);
}
It works somehow already.
The problem:
A lot of products have names or categories with words a user just won't search. In the example I have used "Nike Match football". (Note: I have only translated it for use on SO, as most of the terms are German in the database)
If I search for "Nike football red" I do get the result. But if a search for "Nike ball red" I don't get it, although this is how users will search for it. Afaik Lucene can't search for substrings (except for wildcards), as it only compares tokens - I do need something like this.
I have made Name
and Category
fuzzy and gave every column an appropriate boost according to it's relevance.
I have already read about Ngrams, but I really don't know how to use it correctly. The indexer works when I add the NGramTokenFilter
to my custom analyzer. The problems here are, that I don't want it for every column (just name and category) and the results are completely weird when activating it.
If I add result = new NGramTokenFilter(result, 3, 4);
to my analyzer and search for "nike ball" it just returns nothing.
Is Ngrams the solution here? What am I doing wrong?
And do you have any other suggestions on how to improve a product search?