0
votes

I set up blob indexing and full-text searching for Azure as described in this article: Indexing Documents in Azure Blob Storage with Azure Search.

Some of my pdf's, however, fail in the indexer:

[
    {
        "key": null,
        "errorMessage": "Error processing blob 'https://my-storage.blob.core.windows.net/my-container/mydocument.pdf' with content type '': 422"
    }
]

I double-checked the properties on the blob to make sure its content type was set:

{
    "container": "my-container",
    "name": "mydocument.pdf",
    "metadata": {},
    "lastModified": "Fri, 08 Jul 2016 19:43:15 GMT",
    "etag": "0xXXXXXXXXXXXXXXX",
    "blobType": "BlockBlob",
    "contentLength": "3863790",
    "requestId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "contentSettings": {
        "contentType": "application/pdf",
        "contentMD5": "xxxxxxxxxxxxxxxxxxxxxx=="
    },
    "lease": {
        "status": "unlocked",
        "state": "available"
    }
}

Now, this particular pdf has some security restrictions (no printing), so I thought that might affect it. I created some pdf's from scratch to test it out, and they worked just fine, both with and without the restrictions.

1
would it be possible for you to share the problematic PDF with us to see if the problem is on our end? If so, please ping me at eugenesh at the usual Microsoft domain. Thanks! - Eugene Shvets

1 Answers

0
votes

There are going to be occasional documents that Azure Search cannot handle, due to security restrictions, files being corrupted, etc. There're several knobs to control how such files are handled. Please see this answer for details.