File Format Detection in Azure Search

Question

We have a very large number of blobs in Azure that we would like to add to an Azure Search index. These blobs have a variety of formats (PDF, DOC, RTF, etc), but none of them have file extensions.

Because of this, Azure Search balks during indexing as it appears to only use the file extension to do file format detection. We get the following error, and since all of our files have these "invalid" extensions, it would happen regardless of any threshold set for indexing errors:

Import configuration failed, error creating Indexer: "Error with data source: Document 'https://XXXXXXX.blob.core.windows.net/folder/filename.00001' has unsupported content type 'unsupported'. To index only the blob metadata and ignore its content, set the 'dataToExtract' indexer configuration property to 'storageMetadata'. See https://aka.ms/azsearchblobdatatoextract. To ignore this error and continue indexing blobs with unsupported content types, set the 'failOnUnsupportedContentType' switch in indexer configuration to false. For more information, see https://aka.ms/blob-indexer-parameters-for-extraction. Please adjust your data source definition in order to proceed."

Are there any ways to have Azure Search either do file content based file detection, or at least use meta data on the blob?

Eugene Shvets Eugene Shvets · Accepted Answer · 2019-05-10T14:20:59

Azure Search already does content based content type detection, but some blobs are problematic. These problematic blobs can be skipped over during indexer operation (with a warning so you know what happened), but if such a blob is encountered during indexer creation, the creation fails with the error you encountered.

If you remove (or skip using the blob metadata) the blob in question, do most of your other blobs work as expected? I suspect Azure Search team would be interested in taking a look at the problematic blob if it would be possible for you to share it.

File Format Detection in Azure Search

1 Answers