Substring matching for product search

Question

I have a product database (for soccer) which contains about 5k products. The Lucene index for product search currently contains Name, Category, Color and Numbers (ArtNo and EANs).

Relevant table example for the problem:

| Name                   | Color            |
--------------------------------------------
| Nike Training football | red black        |
| Nike Match football    | black white      |
--------------------------------------------

For the index I have created a custom Analyzer, so I can extend a StandardAnalyzer with additional behavior. The creation of the stream looks like this at the moment:

TokenStream result = new StandardTokenizer(Util.Version.LUCENE_29, reader );
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, stoptable );
return result;

The Analyzer is used both for the Indexer and the Searcher.

This is the current search logic:

BooleanQuery booleanQuery = new BooleanQuery(true);
string[] terms = query.Split(' ');

foreach (string s in terms)
{
  BooleanQuery subQuery = new BooleanQuery(true);

  var nameQuery = new FuzzyQuery(new Term("Name", s), 0.9f);
  nameQuery.SetBoost(6);
  subQuery.Add(nameQuery, BooleanClause.Occur.SHOULD);

  var colorQuery = new TermQuery(new Term("Color", s));
  subQuery.Add(colorQuery, BooleanClause.Occur.SHOULD);

  var categoryQuery = new FuzzyQuery(new Term("Category", s), 0.9f);
  categoryQuery.SetBoost(2);
  subQuery.Add(categoryQuery, BooleanClause.Occur.SHOULD);

  var numbersQuery = new TermQuery(new Term("Numbers", s));
  numbersQuery.SetBoost(10);
  subQuery.Add(numbersQuery, BooleanClause.Occur.SHOULD);

  booleanQuery.Add(subQuery, BooleanClause.Occur.MUST);
}

It works somehow already.

The problem:

A lot of products have names or categories with words a user just won't search. In the example I have used "Nike Match football". (Note: I have only translated it for use on SO, as most of the terms are German in the database)

If I search for "Nike football red" I do get the result. But if a search for "Nike ball red" I don't get it, although this is how users will search for it. Afaik Lucene can't search for substrings (except for wildcards), as it only compares tokens - I do need something like this.

I have made Name and Category fuzzy and gave every column an appropriate boost according to it's relevance.

I have already read about Ngrams, but I really don't know how to use it correctly. The indexer works when I add the NGramTokenFilter to my custom analyzer. The problems here are, that I don't want it for every column (just name and category) and the results are completely weird when activating it.

If I add result = new NGramTokenFilter(result, 3, 4); to my analyzer and search for "nike ball" it just returns nothing.

Is Ngrams the solution here? What am I doing wrong?

And do you have any other suggestions on how to improve a product search?

You should note that your analyzer will not be applied to your query unless you are using a QueryParser. Manually constructed queries are not analyzed, as indicated by the fact that you have to split on spaces yourself instead of letting the analyzer handle that. — femtoRgon

dom dom · Accepted Answer · 2017-10-16T12:49:17

I' not familiar with Ngrams but as i see there are two approaches in your case:

1. Work with wildcards in searches

use Prefix or Fuzzy Queries on the fields you like to search. Important is that you use TextField ( Javadoc) because this fields are going to be analyzed (StringField don't) and are used for fulltext searches. Based on this it should be possible to search with multiple not exact matching terms.

2. Work with different analyzers for different fields

You can analyze different fields with different analyzers with the PerFieldAnalyzerWrapper Javadoc). Define which field should be analyzed with which analyzer and you're good to go. But be aware that you use the same analyzer for indexing and searching (it's lucene best practices)

Additional Informations

If you use Wildcards and Umlauts (German yaaaay) you have to know that Wildcard queries are not going to be analyzed like normal queries. i faced the same problem and solved it with two kind of field:

"normalized" (saved without Umlaut -> convert to basic latin characters with analyzer)
"non-normalized" (saved with Umlaut)

And while searching a BooleanQuery over this two fields.

Substring matching for product search

The problem:

1 Answers