We are now using Azure search Microsoft language analyzers on some of language specific fields. In most of cases, it has better relevance than standard Lucene language analyzers. But we found an issue when verifying en.microsoft analyzer.
The problem is, if the field value contains digits. The analyzer is smart to allow redundant “0” in front the digit.
For example:
POST /analyze?api-version=2017-11-11
{
"text": "1",
"analyzer": "en.microsoft"
}
The response is:
"tokens": [
{
"token": "1",
"startOffset": 0,
"endOffset": 2,
"position": 0
},
{
"token": "nn1",
"startOffset": 0,
"endOffset": 2,
"position": 0
}
]
The problem is, that if the field value is “01”, then all text like “01”, “001”, “0001”, … will match that field.
We have a field to save the product attribute name/value pairs, for example, “brand:Contoso|size:1”. Then even searching “0001” can return the document with this field value. This is not what we want.
So, my question is, is there any way to customize the en.microsoft analyzer so that, we can take advantage of the powerful stemmer of the analyzer but avoid the auto “0” padding in front of the digit?