0
votes

I am new to Azure cognitive search. I have a docx file which is stored in azure blob storage.I am using #Microsoft.Skills.Text.SplitSkill to split the document into multiple pages(chunks).But when I index the output of this skill,I am getting the entire docx file content.how do I return the "pages" from the SplitSkill so that the user sees the portion of the original document that was found by their search instead of returning entire document?

Please assist me.Thank you in advance.

1

1 Answers

0
votes

The split skill allows you to split text into smaller chunks/pages that can be then processed by additional cognitive skills.

Here is what a minimalistic skillset that does splitting and translation may look like:

"skillset": [
    {
        "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
        "textSplitMode": "pages",
        "maximumPageLength": 1000,
        "defaultLanguageCode": "en",
        "inputs": [
            {
                "name": "text",
                "source": "/document/content"
            },
            {
                "name": "languageCode",
                "source": "/document/language"
            }
        ],
        "outputs": [
            {
                "name": "textItems",
                "targetName": "mypages"
            }
        ]
    },
    {
        "@odata.type": "#Microsoft.Skills.Text.TranslationSkill",
        "name": "#2",
        "description": null,
        "context": "/document/mypages/*",
        "defaultFromLanguageCode": null,
        "defaultToLanguageCode": "es",
        "suggestedFrom": "en",
        "inputs": [
            {
                "name": "text",
                "source": "/document/mypages/*"
            }
        ],
        "outputs": [
            {
                "name": "translatedText",
                "targetName": "translated_text"
            }
        ]
    }
]

Note that the split skill generated a collection of text elements under the "\document\mypages" node in the enriched tree. Also not that by providing the context "\document\mypages*" to the translation skill, we are telling the translation skill to perform translation on "each page".

I should point out that documents will still be indexed at the document level though. Skillsets are not really built to "change the cardinality of the index". That said, a workaround for that may be to project each of the pages as separate elements into a knowledge store, and then create a separate index that is actually focused on indexing each page.

Learn more about the knowledge store projections here: https://docs.microsoft.com/en-us/azure/search/knowledge-store-concept-intro