I have an Azure Storage container which contains a mix of files (pdf, doc, docx, jpg, png, ...) stored as blobs.
I'm trying to use the Azure Search blob indexer to index the meta data for all files (including images), and where possible, extract the content for full text searching (obviously images don't have any extractable text content). The idea behind wanting to extract image metadata is that I want an entry in the search index for an image because I have additional data in DocumentDB that I want to manually merge in to the search index using a WebJob.
Using the Azure Portal I have added the data source, index and indexer, however, when the indexer runs, it's failing with the following error:
Document 'https://xxx.blob.core.windows.net/xxx/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-v1' has unsupported content type 'image/jpeg'
Reading the documentation on https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#using-custom-metadata-to-control-document-extraction it mentions that if I add metadata to the blob with a key of "AzureSearch_SkipContent" and a value of "true" then it should not attempt to try extracting content.
After adding the "AzureSearch_SkipContent" metadata to all content types not listed in the table on https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#content-type-specific-metadata-properties , the indexer is still failing with the error above.
If I add "AzureSearch_Skip" metadata set to "true" then the indexer does skip the image blob, but then I don't have anything in the index for it - which is not what I want.
Here is an example of the steps I'm trying to achieve:
- An image of (for example) a fire extinguisher is saved to blob storage
- At the same time I store in DocumentDB some extra information about the fire extinguisher
- I want the blob indexer to find the new image and add a row to the search index for the new blob, without trying to extract any text content
- A custom WebJob will update the new row in the search index with information from the related DocumentDB document
So, should it be possible to add "AzureSearch_SkipContent" to an image blob and have something appear in the search index for it? Or is my only solution to "AzureSearch_Skip" it completely and then manually add something in to the search index for it?