0
votes

I'm trying to understand if there is a way, and how to achieve it, to index binary data (mostly MS Office Documents and PDFs) that do not reside in Azure Blob Storage but on other non-azure data sources.

The closest example I found copies the files to an Azure blob container and then add a skillset to index these docs from there.

I would like to bypass the Azure blob container, and push the doc metadata as well as the binary content directly.

Any advise or example I can look at?

Thanks

2

2 Answers

2
votes

You can define custom skillsets with both custom and built-in skills when you push data to the index. There is Document Extraction skill that does what you want. See:

https://docs.microsoft.com/en-us/azure/search/cognitive-search-skill-document-extraction

0
votes

I would like to bypass the Azure blob container, and push the doc metadata as well as the binary content directly.

As per the documentation available here, I don't think it is possible to have your data outside of Azure. Your data must reside in an Azure Data Source that can be accessed by an Indexer which as of today can be one of Azure blob storage, Azure table storage, Azure SQL Database, and Azure Cosmos DB.

enter image description here