5
votes

I have a fairly basic Azure Search index with several fields of searchable string data, for example [abridged]...

"fields": [
  {
    "name": "Field1",
      "type": "Edm.String",
      "facetable": false,
      "filterable": true,
      "key": true,
      "retrievable": true,
      "searchable": true,
      "sortable": false,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
  },
  {
    "name": "Field2",
      "type": "Edm.String",
      "facetable": false,
      "filterable": true,
      "retrievable": true,
      "searchable": true,
      "sortable": false,
      "analyzer": "en.microsoft",
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
  }
]

Field1 is loaded with alphanumeric id data and Field2 is loaded with English language string data, specifically the name/title of the record. searchMode=all is also being used to ensure the accuracy of the results.

Let's say one of the records indexed has the following Field2 data: BA (Hons) in Business, Organisational Behaviour and Coaching. Putting that into the en.microsoft analyzer, this is the result we get out:

"tokens": [
    {
        "token": "ba",
        "startOffset": 0,
        "endOffset": 2,
        "position": 0
    },
    {
        "token": "hon",
        "startOffset": 4,
        "endOffset": 8,
        "position": 1
    },
    {
        "token": "hons",
        "startOffset": 4,
        "endOffset": 8,
        "position": 1
    },
    {
        "token": "business",
        "startOffset": 13,
        "endOffset": 21,
        "position": 3
    },
    {
        "token": "organizational",
        "startOffset": 23,
        "endOffset": 37,
        "position": 4
    },
    {
        "token": "organisational",
        "startOffset": 23,
        "endOffset": 37,
        "position": 4
    },
    {
        "token": "behavior",
        "startOffset": 38,
        "endOffset": 47,
        "position": 5
    },
    {
        "token": "behaviour",
        "startOffset": 38,
        "endOffset": 47,
        "position": 5
    },
    {
        "token": "coach",
        "startOffset": 52,
        "endOffset": 60,
        "position": 7
    },
    {
        "token": "coaching",
        "startOffset": 52,
        "endOffset": 60,
        "position": 7
    }
]

As you can see, the tokens returned are what you'd expect for such a string. However, when it comes to using that same indexed string value as a search term (sadly a valid user case in this instance), the results returned are not as expected unless you explicitly use searchFields=Field2.

Query 1 (Returns 0 results):

?searchMode=all&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching

Query 2 (Returns 0 results):

?searchMode=all&searchFields=Field1,Field2&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching

Query 3 (Returns 1 result as expected):

?searchMode=all&searchFields=Field2&search=BA%20(Hons)%20in%20Business%2C%20Organisational%20Behaviour%20and%20Coaching

So why does this only return the expected result with searchFields=Field2 and not with no searchFields defined or searchFields=Field1,Field2? I would not expect a no match on Field1 to exclude a result that's clearly matching on Field2?

Furthermore, removing the "in" and "and" within the search term seems to correct the issue and return the expected result. For example:

Query 4 (Returns 1 result as expected):

?searchMode=all&search=BA%20(Hons)%20Business%2C%20Organisational%20Behaviour%20Coaching

(This is almost like one analyzer is tokenizing the indexed data and a completely different analyzer is tokenizing the search term, although that theory doesn't make any sense when taking into consideration Query 3, as that provides a positive match using the exact same indexed data/search term.)

Is anybody able to shed some light as to what's going on here as I'm completely out of ideas and I can't find anything more in the documentation?

NB. Please bear in mind that I'm looking to understand why Azure Search is behaving in this way and not necessarily wanting a work around.

1

1 Answers

2
votes

The reason you don't get any hits is due to how stopwords are handled when you use searchMode=all. The standard analyzer does not remove stopwords. The Lucene and Microsoft analyzers for English removes stopwords. I verified by creating an index with your property definitions and sample data. If you use the standard analyzer, stopwords are not removed and you will get a match also when using searchMode=all. To get a match when using either Lucene or Microsoft analyzers with simple query mode, you would have to use a phrase search.

When you test the en.microsoft analyzer in your example, you only get the response from what the first stage of the analyzer does. It splits your query into tokens. In your case, two of the tokens are also stopwords in English (in, and). Stopword removal is part of lexical analysis, which is done later in stage 2 as explained in the article called Anatomy of a search request. Furthermore, lexical analysis is only applied to "query types that require complete terms", like searchMode=all. See Exceptions to lexical analysis for more examples.

There is a previous post here about this that explains in more detail. See Queries with stopwords and searchMode=all return no results

I know you did not ask for workarounds, but to better understand what goes on it could be useful to list some possible workarounds.

  • For English analyzers, use phrase search by wrapping the query in quotes: search="BA (Hons) in Business, Organisational Behaviour and Coaching"&searchMode=all
  • The standard analyzer works the way you expect: search=BA (Hons) in Business, Organisational Behaviour and Coaching&searchMode=all
  • Disable lexical analysis by defining a custom analyzer.