Indexing HTML content from Azure database with Azure Search

Question

We store some CMS content in our Azure database, and need to index some HTML content from our database.

What are best practices for indexing this in Azure Search, such that it only indexes content, and not the HTML? Or, such that the index recognizes is as HTML, and will ignore HTML markup?

I know one option would be for me to manipulate it before it gets to the index or on its way, but was hoping there were some built-in capabilities in Azure Search.

Eugene Shvets Eugene Shvets · Accepted Answer · 2017-01-12T20:47:03

Currently, Azure blob indexer is the only Azure Search indexer that supports parsing HTML in a way that strips HTML markup. Azure SQL indexer treats HTML text just as a chunk of text.

You have several potential options:

Use SQL indexer and accept HTML markup being indexed - depending on your documents, your search quality may still be good.
Pre-process your data and strip the HTML markup, then put parsed text back into SQL (and use SQL indexer), or you indexing API to push data into a search index.
Store HTML data in blob storage and use the blob indexer to index HTML data, while continuing to use SQL indexer to index the rest of the data. Multiple indexers can write into the same search index, in effect "assembling" documents from multiple data sources.

Indexing HTML content from Azure database with Azure Search

3 Answers