2
votes

We store some CMS content in our Azure database, and need to index some HTML content from our database.

What are best practices for indexing this in Azure Search, such that it only indexes content, and not the HTML? Or, such that the index recognizes is as HTML, and will ignore HTML markup?

I know one option would be for me to manipulate it before it gets to the index or on its way, but was hoping there were some built-in capabilities in Azure Search.

3

3 Answers

0
votes

Currently, Azure blob indexer is the only Azure Search indexer that supports parsing HTML in a way that strips HTML markup. Azure SQL indexer treats HTML text just as a chunk of text.

You have several potential options:

  1. Use SQL indexer and accept HTML markup being indexed - depending on your documents, your search quality may still be good.
  2. Pre-process your data and strip the HTML markup, then put parsed text back into SQL (and use SQL indexer), or you indexing API to push data into a search index.
  3. Store HTML data in blob storage and use the blob indexer to index HTML data, while continuing to use SQL indexer to index the rest of the data. Multiple indexers can write into the same search index, in effect "assembling" documents from multiple data sources.
0
votes

You could try with a Custom Analyzer with a custom Char Filter.

Char Filters can be used to "clean" the input with either a mapping or a pattern replace (Regular Expression).

The pattern replace its internally using the PatternReplaceCharFilter.

Please keep in mind that complex expresions will probably have the consequence of longer indexing times.

0
votes

I'm using such custom analyzer to index HTML. Don't know if it's the best way.

    {
      "name": "bodyHtml",
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "tokenizer": "standard_v2",
      "tokenFilters": [
        "lowercase", "asciifolding"
      ],
      "charFilters": [
        "html_strip"
      ]
    }