4
votes

a little relating and continuing to this question: Azure Search Analyzer

I want to use a keywordanalyzer for word collections.

We have documents (products) with different fields like product_name, brand, categorie and so on.
To implement a keyword based ranking (scoring) I would like to add a Collection(Edm.String) field which contains different (untokenized!!) keywords, like: "brown teddy" or "green bean".
To achieve this I thought about using a keywordanalyzer with the following definition:

// field definition:
{
"name": "keyWordList",
"type": "Collection(Edm.String)",
"analyzer": "keywordAnalyzer"
}
...

"analyzers": [ {
"name":"keywordAnalyzer",
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"keywordTokenizer",
"tokenFilters":[ "lowercase", "classic" ]
} ]
...

"tokenizers": [{
"name": "keywordTokenizer",
"@odata.type": "#Microsoft.Azure.Search.KeywordTokenizer"
}

Now after having uploaded some documents, I just can't find the fields by entering exactly the chosen keywords. For example the is a document with the following field-data:

"keyWordList": [ "Blue Bear", "blue bear", "blue bear123" ]

Im not able to find any result by querying the following search:

{ search:"blue bear", count:"true", queryType:"full" }

Here is what I tried as well:

  • using the predefined keywordanalyzer instead of a customized one -> no success
  • instead of using Collection(Edm.String) I just tested it with a normal String field, containing only one keyword -> no success
  • splitting up the analyzer in the field definition-block into searchAnalyzer="lowercaseAnalyzer" and filterAnalyzer="keywordAnalyzer" vice versa -> no success

In the end the only result I could get was via sending the whole seach phase as a single term. But this should be done by the analyzer, right?!

{ search:"\"blue bear\"", count:"true", queryType:"full" }

Users don't know if they search for an existing keyword or perform a tokenized search. That's why this won't be an option.

Is there any solution to this issue of mine? Or is there maybe a better / easier approach for this kind of keyword (high scoring) seach?

Thanks!

1

1 Answers

9
votes

Short answer:

The behavior you're observing is correct.

Semantically, your search query blue bear means: find all documents that match the term blue or the term bear. Since you are using the keyword tokenizer the terms that you indexed are blue bear and blue bear123. The terms blue and bear individually don't exist in your index. That's why only the phrase query returns the result you are expecting.


Long answer:

Let me explain how the analyzer is applied during query processing and how it's applied during document indexing.

On the indexing side, the analyzer you defined processes elements of the keyWordList collection independently. The terms that end up in your inverted index are:

  • blue bear (since you're using the lowercase filter blue bear and Blue Bear are tokenized to the same term).
  • blue bear123

    As you'd expect blue bear is one term - not split into two on space - since you're using the keyword tokenizer. Same applies to blue bear123

On the query processing side, two things happen:

  1. Your search query is rewritten too: blue|bear (find documents blue or bear). This is because searchMode=any is used by default. If you used searchMode=all, your search query would be rewritten to blue+bear (find documents with blue and bear).

    The query parser takes your search query string and separates query operators (such as +, |, * etc.) from query terms. Then it decomposes the search query into subqueries of supported types e.g., terms followed by the suffix operator ‘*’ become a prefix query, quoted terms a phrase query etc. Terms that are not preceded or followed by any the supported operators become individual term queries.

    In your example, the query parser decomposed your query string blue|bear into two term queries with terms blue and bear respectively. The search engine looks for documents that match any of those queries (searchMode=any).

  2. Query terms of the identified subqueries are processed by the search analyzer.

    In your example, terms blue and bear are processed by the analyzer individually. They are not modified since they are already lowercase. None of those tokens exist in your index, thus no results are returned.

    If you query looked as follows: "Blue Bear" (with quotes) it would be rewritten to "Blue Bear" - notice no change, the OR operator has not been put between the words since now you're looking for a phrase. The query parser passes the entire phrase term (two words) to the analyzer which in turn outputs a single, lowercased term: blue bear. This token matches what's in your index.

The key lesson here is that the query parser processes the query string before the analyzers are applied. The analyzers are applied to individual terms of subqueries identified by the query parser.

I hope this helps you understand the behavior you're observing. Note, you can test the output of your custom analyzer using the Analyze API.