0
votes

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-... Is there some setting or way to change this behavior?

Current search settings:

TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram")
{
    Side = EdgeNGramTokenFilterSide.Front,
    MinGram = 3,
    MaxGram = 20
});

TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
{
    TokenFilters =
    {
        TokenFilterName.Lowercase,
        new TokenFilterName("frontEdgeNGram"),
        TokenFilterName.Classic,
        TokenFilterName.AsciiFolding
    }
});

SearchOptions UsersSearchOptions = new SearchOptions
{
    QueryType = SearchQueryType.Simple,
    SearchMode = SearchMode.All,
};

Using azure.search.documents ver 11.1.1

Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?

Just to add to this..

The portal version is 2020-06-30 The sdk version we use is azure.search.documents ver 11.1.1

  1. abc-123-456 does NOT work as expected
  2. "abc-123-456" does NOT work as expected
  3. "abc-123-456"* does NOT work
  4. "abc-123-456*" does NOT work

If we append an asterisks to the end of the search text and it is not within a phrase .. it works as expected. IE: abc-123-456* works as expected. (abc-123-456* | abc-123-457* ) works as expected.

Why is the asterisks required? How can we make this work within a phrase?

1

1 Answers

0
votes

This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.

For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms.

The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.

Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.