we want to use the language specific analyzers provided by azure search, but add the html_char filter from Lucene. Our idea was to build a custom analyzer that uses the same components (tokenizer, filters) as for example the en.microsoft analyzer but add the additional char filter.
Sadly we can't find any documentation on what exactly constitutes the en.microsoft analyzer or any other Microsoft analyzer. We do not know which tokenizers or filters to use to get the same result with a custom analyzer.
Can anyone point us in to the right documentation?
The documentation says that the en.microsoft analyzer performs lemmatization instead of stemming but I can't find any tokenizer or filter that claims to use lemmatization only stemmers.