0
votes

I am trying to create an index and skill that will allow me to

Index pdfs, multi and single page, and all other types of files, Extract the Data and make it searchable,

Search for a term say "Cat" and have sections of text where the term appears to be returned, as well as the page number and document name / downloadable URL of the PDF/ image where it was found, a bounding box, would be nice but not necessary.

I am struggling, I have tried text extraction skill, OCR skill, but I am struggling in that the Search term returns the whole, extracted document (100 pages), as text in the file "content"

It's not making much sense to me, the JFK example is outdated.

I have spent 4 days on this, it cannot be that difficult, the documentation is not that helpful either.

I have tied to "build" and index and skillset using the portal tools, but getting a similar result.

any help would be appreciated.

1

1 Answers

0
votes

You might want to try the hOCR custom skill, available on GitHub from the Power Skills repository if you prefer to use the hOCR format for bounding boxes, but [the OCR skill](https://docs.microsoft.com/en-us/azure/search/cognitive-search-skill-ocr#sample-text-and-layouttext-output's output) already offers bounding boxes for content. Note that the Power Skills repo also has updated versions of most of the skills used in the JFK sample, including the image store that can help you make pictures of the pages available in your app.

The key to making it work is in the skillset definition.

The JFK skillset has its OCR skill output layoutText.

There is also a custom image store skill that uploads /document/normalized_images/*/data and keeps the resulting URI as imageStoreUri.

Another custom skill transforms the OCR layout results into the HOCR format.

Then a ShaperSkill is aggregating that information under ocrImageMetadata.

In the case of JFK, that information then gets further aggregated under cryptonyms, because that's the main thing the JFK demo is focusing on, and the image metadata is also an output field mapping for /document/hocrDocument/metadata as metadata, which is also indexed. The important point is that all the relevant information is mapped to the indexed fields. As a consequence, the information therein becomes available from index query results.