3
votes

I have an Azure Storage container which contains a mix of files (pdf, doc, docx, jpg, png, ...) stored as blobs.

I'm trying to use the Azure Search blob indexer to index the meta data for all files (including images), and where possible, extract the content for full text searching (obviously images don't have any extractable text content). The idea behind wanting to extract image metadata is that I want an entry in the search index for an image because I have additional data in DocumentDB that I want to manually merge in to the search index using a WebJob.

Using the Azure Portal I have added the data source, index and indexer, however, when the indexer runs, it's failing with the following error:

Document 'https://xxx.blob.core.windows.net/xxx/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-v1' has unsupported content type 'image/jpeg'

Reading the documentation on https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#using-custom-metadata-to-control-document-extraction it mentions that if I add metadata to the blob with a key of "AzureSearch_SkipContent" and a value of "true" then it should not attempt to try extracting content.

After adding the "AzureSearch_SkipContent" metadata to all content types not listed in the table on https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#content-type-specific-metadata-properties , the indexer is still failing with the error above.

If I add "AzureSearch_Skip" metadata set to "true" then the indexer does skip the image blob, but then I don't have anything in the index for it - which is not what I want.

Here is an example of the steps I'm trying to achieve:

  • An image of (for example) a fire extinguisher is saved to blob storage
  • At the same time I store in DocumentDB some extra information about the fire extinguisher
  • I want the blob indexer to find the new image and add a row to the search index for the new blob, without trying to extract any text content
  • A custom WebJob will update the new row in the search index with information from the related DocumentDB document

So, should it be possible to add "AzureSearch_SkipContent" to an image blob and have something appear in the search index for it? Or is my only solution to "AzureSearch_Skip" it completely and then manually add something in to the search index for it?

1

1 Answers

3
votes

AzureSearch_SkipContent flag only works for supported content types, where Azure Search can extract content-type specific metadata.

Azure Search also supports indexing only the storage metadata and skipping content type metadata and content extraction - in this case, the content type doesn't matter. However, this setting is only available at the indexer scope and applies to all blobs. See Index storage metadata only.

We've heard similar question from several customers, so we're adding another switch that will behave as follows:

  1. Blobs with supported content types will be fully indexed (respecting per-blob flags, of course)
  2. For blobs with unsupported content types, Azure Search will index storage metadata and not fail on those blobs like it does today.

It looks like this will be helpful in your case.

UPDATE on Dec 7, 2016: This functionality is now available. To continue indexing when an unsupported content type is encountered, set the failOnUnsupportedContentType configuration parameter to false:

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2016-09-01
Content-Type: application/json
api-key: [admin key]

{
 ... other parts of indexer definition
 "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
} 

For more info, see Controlling which blobs are indexed