We're using Azure Cognitive Search as our search engine for searching for images. The analyzer is Lucene standard and when a user searches for "scottish landscapes" some of our users claim that their image is missing. They will then have to add the keyword "landscapes" in their images so that the search engine can find them.
Changing the analyzer to "en-lucene" or "en-microsoft" only seemed to have way smaller search results, which we didn't like for our users.
Azure Cognitive Search does not seem to distinguish singular and plural words. To resolve the issue, I created a dictionary in the database, used inflection and tried manipulating the search terms:
foreach (var term in terms)
{
if (ps.IsSingular(term))
{
// check with db
var singular = noun.GetSingularWord(term);
if (!string.IsNullOrEmpty(singular))
{
var plural = ps.Pluralize(term);
keywords = keywords + " " + plural;
}
}
else
{
// check with db
var plural = noun.GetPluralWord(term);
if (!string.IsNullOrEmpty(plural))
{
var singular = ps.Singularize(term);
keywords = keywords + " " + singular;
}
}
}
My solution is not 100% ideal but it would be nicer if Azure Cognitive Search can distinguish singular and plural words.
UPDATE: Custom Analyzers may be the answer to my problem, I just need to find the right token filters.
UPDATE: Below is my custom analyzer. It removes html constructs, apostrophes, stopwords and converts them to lowercase. The tokenizer is MicrosoftLanguageStemmingTokenizer and it reduces the words to its root words so it's apt for plural to singular scenario (searching for "landscapes" returns "landscapes" and "landscape")
"analyzers": [
{
"name": "p4m_custom_analyzer",
"@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"charFilters": [
"html_strip",
"remove_apostrophe"
],
"tokenizer": "custom_tokenizer",
"tokenFilters": [
"lowercase",
"remove_stopwords"
]
}
],
"charFilters": [
{
"name": "remove_apostrophe",
"@odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
"mappings": ["'=>"]
}
],
"tokenizers": [
{
"name": "custom_tokenizer",
"@odata.type":"#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"isSearchTokenizer": "false"
}
],
"tokenFilters": [
{
"name": "remove_stopwords",
"@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter"
}
]
I have yet to figure out the other way around. If the user searches for "apple" it should return "apple" and "apples".