7
votes

I have a data factory with a pipeline copy activity like this:

{
  "type": "Copy",
  "name": "Copy from storage to SQL",
  "inputs": [
    {
      "name": "storageDatasetName"
    }
  ],
  "outputs": [
    {
      "name": "sqlOutputDatasetName"
    }
  ],
  "typeProperties": {
    "source": {
      "type": "BlobSource"
    },
    "sink": {
      "type": "SqlSink"
    }
  },
  "policy": {
    "concurrency": 1,
    "retry": 3
  },
  "scheduler": {
    "frequency": "Month",
    "interval": 1
  }
}

The input data is approx 90MB in size, about 1.5 million rows, broken into approx. 20 x 4.5MB block blob files in Azure Storage. Here's an example of the data (CSV):

A81001,1,1,1,2,600,3.0,0.47236654,141.70996,0.70854986 A81001,4,11,0,25,588,243.0,5.904582,138.87576,57.392536 A81001,7,4,1,32,1342,278.0,7.5578647,316.95795,65.65895

The sink is an Azure SQL Server of type S2, which is rated at 50 DTUs. I've created a simple table with sensible data types, and no keys, indexes or anything fancy, just columns:

CREATE TABLE [dbo].[Prescriptions](
    [Practice] [char](6) NOT NULL,
    [BnfChapter] [tinyint] NOT NULL,
    [BnfSection] [tinyint] NOT NULL,
    [BnfParagraph] [tinyint] NOT NULL,
    [TotalItems] [int] NOT NULL,
    [TotalQty] [int] NOT NULL,
    [TotalActCost] [float] NOT NULL,
    [TotalItemsPerThousand] [float] NOT NULL,
    [TotalQtyPerThousand] [float] NOT NULL,
    [TotalActCostPerThousand] [float] NOT NULL
)

The source, sink and data factory are all in the same region (North Europe).

According to Microsoft's 'Copy activity performance and tuning guide', for Azure Storage Source and Azure SQL S2 sink, I should be getting about 0.4 MBps. By my calculation, that means 90MB should transfer in about half and hour (is that right?).

enter image description here

For some reason it copies 70,000 rows very quickly, then seems to hang. Using SQL management studio I can see the count of rows in the database table is exactly 70,000 and hasn't increased at all in 7 hours. Yet the copy task is still running with no errors:

enter image description here

Any ideas why this is hanging at 70,000 rows? I can't see anything unusual about the 70,001st data row which would cause a problem. I've tried compeltely trashing the data factory and starting again, and I always get the same behaviour. i have another copy activity with a smaller table (8000 rows), which completes in 1 minute.

2

2 Answers

12
votes

Just to answer my own question in case it helps anyone else:

The issue was with null values. The reason that my run was hanging at 70,000 rows was that at row 76560 of my blob source file, there was a null value in one of the columns. The HIVE script I had used to generate this blob file had written the null value as '\N'. Also, my sink SQL table specified 'NOT NULL' as part of the column, and the column was a FLOAT value.

So I made two changes: added the following property to my blob dataset definition:

"nullValue": "\\N"

And made my SQL table column nullable. It now runs completely and doesn't hang! :)

The problem is that the Data Factory did not error, it just got stuck - it would be nice if the job had failed with a useful error message, and told me what row of the data was the problem. I think because the write batch size is 10,000 by default, this is why it got stuck at 70,000 and not at 76560.

0
votes

Here is a new workaround, just set write batch size to cover the default value(10,000)

click here to see my copy data activity config