1
votes

We are trying to enable full text search. Application stores PDF files in the Azure Blob Storage, which is the data source for Azure Search. Majority of this works fine however the Indexer is not able to extract text from couple of PDFs. Are there any specific kinds of PDFs that Azure Search Indexer can extract?. If Yes, What are they?

Any information, Help/Support in this regard greatly appreciated.

3

3 Answers

1
votes

Azure Search can extract all text from PDF text elements. Extracting text from embedded images (which requires OCR) or tables is not yet integrated in Azure Search, but it is on the roadmap.

If your PDFs contain images and you want to extract text from those as well, then you can try following the steps here.

1
votes

Are there any specific kinds of PDFs that Azure Search Indexer can extract?

Based on my experience, there are no specific kinds of PDFs that Azure search Indexer can't extract. According to your description, I assume that it reaches the Azure search limitation. For more detailed information please refer to Indexing Documents in Azure Blob Storage with Azure Search.

Azure Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, and 4 million for Standard, Standard S2 and Standard S3 tiers. A warning is included in the indexer status response for truncated documents.

0
votes

I recently wrote a blog post about my experience with this. I ended up using a python-based script running in a Docker container within Azure Somewhat complicated, but the blog lays it out pretty clearly (and the results have been very good as far as OCR/searchability)

http://martyice.github.io/docker-in-azure/