0
votes

We're using Azure Cognitive Search as our search engine for searching for images. The analyzer is Lucene standard and when a user searches for "scottish landscapes" some of our users claim that their image is missing. They will then have to add the keyword "landscapes" in their images so that the search engine can find them.

Changing the analyzer to "en-lucene" or "en-microsoft" only seemed to have way smaller search results, which we didn't like for our users.

Azure Cognitive Search does not seem to distinguish singular and plural words. To resolve the issue, I created a dictionary in the database, used inflection and tried manipulating the search terms:

foreach (var term in terms)
{                
    if (ps.IsSingular(term))
    {
        // check with db 
        var singular = noun.GetSingularWord(term);
        if (!string.IsNullOrEmpty(singular))
        {
            var plural = ps.Pluralize(term);
            keywords = keywords + " " + plural;
        }
    }
    else
    {
        // check with db
        var plural = noun.GetPluralWord(term);
        if (!string.IsNullOrEmpty(plural))
        {
            var singular = ps.Singularize(term);
            keywords = keywords + " " + singular;
        }
    }
}

My solution is not 100% ideal but it would be nicer if Azure Cognitive Search can distinguish singular and plural words.

UPDATE: Custom Analyzers may be the answer to my problem, I just need to find the right token filters.

UPDATE: Below is my custom analyzer. It removes html constructs, apostrophes, stopwords and converts them to lowercase. The tokenizer is MicrosoftLanguageStemmingTokenizer and it reduces the words to its root words so it's apt for plural to singular scenario (searching for "landscapes" returns "landscapes" and "landscape")

"analyzers": [      
      {
          "name": "p4m_custom_analyzer",
          "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
          "charFilters": [
              "html_strip",              
              "remove_apostrophe"              
          ],
          "tokenizer": "custom_tokenizer",
          "tokenFilters": [
              "lowercase",
              "remove_stopwords"                                                                     
          ]
      }
  ],
  "charFilters": [          
      {
          "name": "remove_apostrophe",
          "@odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
          "mappings": ["'=>"]
      }
  ],
  "tokenizers": [
      {
          "name": "custom_tokenizer",
          "@odata.type":"#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
          "isSearchTokenizer": "false"          
      }
  ],
  "tokenFilters": [      
      {
          "name": "remove_stopwords",
          "@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter"          
      }     
  ]

I have yet to figure out the other way around. If the user searches for "apple" it should return "apple" and "apples".

1

1 Answers

3
votes

Both en.lucene and en.microsoft should have helped with this, you shouldn't need to manually expand inflections on your side. I'm surprised to hear you see less recall with them. Generally speaking I would expect higher recall with those than the standard analyzer. Do you by any chance have multiple searchable fields with different analyzers? That could interfere. Otherwise, it would be great to see a specific case (a query/document pair along with the index definition) to investigate further.

As a quick test, I used this small index definition:

{
    "name": "inflections",
    "fields": [
        {
            "name": "id",
            "type": "Edm.String",
            "searchable": false,
            "filterable": true,
            "retrievable": true,
            "sortable": false,
            "facetable": false,
            "key": true
        },
        {
            "name": "en_ms",
            "type": "Edm.String",
            "searchable": true,
            "filterable": false,
            "retrievable": true,
            "sortable": false,
            "facetable": false,
            "key": false,
            "analyzer": "en.microsoft"
        }
    ]
}

These docs:

{
    "id": "1",
    "en_ms": "example with scottish landscape as part of the sentence"
},
{
    "id": "2",
    "en_ms": "this doc has one apple word"
},
{
    "id": "3",
    "en_ms": "this doc has two apples in it"
}

For this search search=landscapes I see these results:

{
    "value": [
        {
            "@search.score": 0.9631388,
            "id": "1",
            "en_ms": "example with scottish landscape as part of the sentence"
        }
    ]
}

And for search=apple I see:

{
    "value": [
        {
            "@search.score": 0.51188517,
            "id": "3",
            "en_ms": "this doc has two apples in it"
        },
        {
            "@search.score": 0.46152657,
            "id": "2",
            "en_ms": "this doc has one apple word"
        }
    ]
}