0
votes

I am using Azure Data Factory from copying data from REST API to Azure Data Lake Store. Following is the JSON of my activity

{
    "name": "CopyDataFromGraphAPI",
    "type": "Copy",
    "policy": {
        "timeout": "7.00:00:00",
        "retry": 0,
        "retryIntervalInSeconds": 30,
        "secureOutput": false
    },
    "typeProperties": {
        "source": {
            "type": "HttpSource",
            "httpRequestTimeout": "00:30:40"
        },
        "sink": {
            "type": "AzureDataLakeStoreSink"
        },
        "enableStaging": false,
        "cloudDataMovementUnits": 0,
        "translator": {
            "type": "TabularTranslator",
            "columnMappings": "id: id, name: name, email: email, administrator: administrator"
        }
    },
    "inputs": [
        {
            "referenceName": "MembersHttpFile",
            "type": "DatasetReference"
        }
    ],
    "outputs": [
        {
            "referenceName": "MembersDataLakeSink",
            "type": "DatasetReference"
        }
    ]
}

The REST API is created by me. First for testing purpose I am returning just 2500 rows and my Pipeline was working fine. It copied the data from REST API call to Azure Data Lake Store.

After testing I update the REST API and now It is returning 125000 rows. I tested that API in REST client and its working fine. But in Azure Data Factory's Copy Activity It is giving following error while copying data to Azure Data Lake Store.

{
    "errorCode": "2200",
    "message": "Failure happened on 'Sink' side. ErrorCode=UserErrorFailedToReadHttpFile,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to read data from http source file.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Net.WebException,Message=The remote server returned an error: (500) Internal Server Error.,Source=System,'",
    "failureType": "UserError",
    "target": "CopyDataFromGraphAPI"
}

The sink side is Azure Data Lake Store. Is there any limit of content size which I am copying from REST call to Azure Data Lake Store.

I also retested the pipeline by updating REST API call(2500 rows) and It worked fine and when I updated API Call and It returns 125000 rows. My pipeline starts giving the same above mentioned error.

My source DataSet in Copy Activity is

{
    "name": "MembersHttpFile",
    "properties": {
        "linkedServiceName": {
            "referenceName": "WM_GBS_LinikedService",
            "type": "LinkedServiceReference"
        },
        "type": "HttpFile",
        "structure": [
            {
                "name": "id",
                "type": "String"
            },
            {
                "name": "name",
                "type": "String"
            },
            {
                "name": "email",
                "type": "String"
            },
            {
                "name": "administrator",
                "type": "Boolean"
            }
        ],
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "arrayOfObjects",
                "jsonPathDefinition": {
                    "id": "$.['id']",
                    "name": "$.['name']",
                    "email": "$.['email']",
                    "administrator": "$.['administrator']"
                }
            },
            "relativeUrl": "api/workplace/members",
            "requestMethod": "Get"
        }
    }
}

Sink Data Set is

{
    "name": "MembersDataLakeSink",
    "properties": {
        "linkedServiceName": {
            "referenceName": "DataLakeLinkService",
            "type": "LinkedServiceReference"
        },
        "type": "AzureDataLakeStoreFile",
        "structure": [
            {
                "name": "id",
                "type": "String"
            },
            {
                "name": "name",
                "type": "String"
            },
            {
                "name": "email",
                "type": "String"
            },
            {
                "name": "administrator",
                "type": "Boolean"
            }
        ],
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "arrayOfObjects",
                "jsonPathDefinition": {
                    "id": "$.['id']",
                    "name": "$.['name']",
                    "email": "$.['email']",
                    "administrator": "$.['administrator']"
                }
            },
            "fileName": "WorkplaceMembers.json",
            "folderPath": "rawSources"
        }
    }
}
1

1 Answers

0
votes

As far as I know, there is no limit to file size. I've had a 10 gb csv with millions of rows and the data lake doesn't care.

What I can see is that while the error says "sink" side, the error code is UserErrorFailedToReadHttpFile so I think the issue may be solved if you change the httpRequestTimeout on your source, as of now it is "00:30:40" and maybe the row transferring is being interrupted because of it. 30 minutes is a lot of time for 2500 rows, but maybe 125k doesn't fit there.

Hope this helped!