0
votes

During developing pipeline which will use Elasticsearch as a source I faced with issue related paging. I am using SQL Elasticsearch API. Basically, I've started to do request in postman and it works well. The body of request looks following:

{
    "query":"SELECT Id,name,ownership,modifiedDate FROM \"core\"  ORDER BY Id",
    "fetch_size": 20,
    "cursor" : ""
}

After first run in response body it contains cursor string which is pointer to next page. If in postman I send the request and provide cursor value from previous request it return data for second page and so on. I am trying to archive the same result in Azure Data Factory. For this I using copy activity, which store response to Azure blob. Setup for source is following. copy activity source configuration

This is expression for body

    {
    "query": "SELECT Id,name,ownership,modifiedDate FROM \"@{variables('TableName')}\" WHERE ORDER BY Id","fetch_size": @{variables('Rows')}, "cursor": ""
    }

I have no idea how to correctly setup pagination rule. The pipeline works properly but only for the first request. I've tried to setup Headers.cursor and expression $.cursor but this setup leads to an infinite loop and pipeline fails with the Elasticsearch restriction.

I've also tried to read document at https://docs.microsoft.com/en-us/azure/data-factory/connector-rest#pagination-support but it seems pretty limited in terms of usage examples and difficult for understanding.

Could somebody help me understand how to build the pipeline with paging abilities utilization? enter image description here

Responce with the cursor looks like:

{
    "columns": [
        {
            "name": "companyId",
            "type": "integer"
        },
        {
            "name": "name",
            "type": "text"
        },
        {
            "name": "ownership",
            "type": "keyword"
        },
        {
            "name": "modifiedDate",
            "type": "datetime"
        }
    ],
    "rows": [
        [
            2,
            "mic Inc.",
            "manufacture",
            "2021-03-31T12:57:51.000Z"
        ]
    ],
    "cursor": "g/WuAwFaAXNoRG5GMVpYSjVWR2hsYmtabGRHTm9BZ0FBQUFBRUp6VGxGbUpIZWxWaVMzcGhVWEJITUhkbmJsRlhlUzFtWjNjQUFBQUFCQ2MwNWhaaVIzcFZZa3Q2WVZGd1J6QjNaMjVSVjNrdFptZDP/////DwQBZgljb21wYW55SWQBCWNvbXBhbnlJZAEHaW50ZWdlcgAAAAFmBG5hbWUBBG5hbWUBBHRleHQAAAABZglvd25lcnNoaXABCW93bmVyc2hpcAEHa2V5d29yZAEAAAFmDG1vZGlmaWVkRGF0ZQEMbW9kaWZpZWREYXRlAQhkYXRldGltZQEAAAEP"
}
1
Does the response returned by the server contain the absolute or relative address of the next page? The official Microsoft tutorial is to directly send the absolute address of the next page request. - Joseph Xu
No, unfortunately, response only contain cursor. If I pass this cursor in the body of request, the response will contain data from next page. - Eduard K.
Does the cursor contain information on the next page? Can you show some snippets the cursor? - Joseph Xu
Of cource. Sorry. I will add it to question - Eduard K.

1 Answers

0
votes

I finally find the solution, hopefully, it will be useful for the community. Basically, what needs to be done it is split the solution into four steps.

Step 1 Make the first request as in the question description and stage file to blob. Step 2 Read blob file and get the cursor value, set it to variable Step 3 Keep requesting data with a changed body

{"cursor" : "@{variables('cursor')}" }

Pipeline looks like this: pipeline

Configuration of pagination looks following pagination . It is a workaround as the server ignores this header, but we need to have something which allows sending a request in loop.