1
votes

we want to use the language specific analyzers provided by azure search, but add the html_char filter from Lucene. Our idea was to build a custom analyzer that uses the same components (tokenizer, filters) as for example the en.microsoft analyzer but add the additional char filter.

Sadly we can't find any documentation on what exactly constitutes the en.microsoft analyzer or any other Microsoft analyzer. We do not know which tokenizers or filters to use to get the same result with a custom analyzer.

Can anyone point us in to the right documentation?

The documentation says that the en.microsoft analyzer performs lemmatization instead of stemming but I can't find any tokenizer or filter that claims to use lemmatization only stemmers.

1
Whoever has been voting to close: This is a relevant and well-formed question about how to programmatically interact with Azure Search. Please don’t close it. @samy I don’t know the answer offhand, but I’ll find someone who does.Bruce Johnston
Thanks @Bruce! I hope they are customizable.samy

1 Answers

3
votes

To create a customized version of a Microsoft analyzer, start with the Microsoft tokenizer for a given language (we have a stemming and non-stemming version), and add token filters from the set of available token filters to customize the output token stream. Note that the stemming tokenizer also does lemmatization, depending on the language.

In most cases, a Microsoft language analyzer is a Microsoft tokenizer plus a stopwords token filter and a lowercase token filter, but this varies depending on the language. In some cases we do language specific character normalization.

We recommend using the above as a starting point. The Analyze API can then be used for testing your configuration to see if it gives you the results you want.