U-SQL Paralell reading from SQL Table

Question

I have a scenario in which I am ingesting data from a MS SQL DB into Azure Data Lake using U-SQL. My table is quite big, with over 16 millions records (soon it will be much more). I just do a SELECT a, b, c FROM dbo.myTable;

I realized, however, that only one vertex is used to read from the table.

My question is, is there any way to leverage parallelism while reading from a SQL table?

Hi, have you considered Azure Data Factory? Is the MS SQL DB also in Azure if so IaaS or PaaS? — Alex KeySmith
I did consider that option. And yes extracting CSV files in U-SQL can be parallelized by default. But I was hoping to use only U-SQL. Thanks for your input! — candidson

wBob wBob · Accepted Answer · 2017-08-21T12:06:51

I don't believe parallelism for external data sources is supported yet for U-SQL (although happy to be corrected). If you feel this is an important missing feature you can create a request and vote for it here:

https://feedback.azure.com/forums/327234-data-lake

As a workaround, you could manually parallelise your queries, depending on the columns available in your datasource. eg by date

// External query working
USE DATABASE yourADLADB;

// Create the external query for year 2016
@results2016 =
    SELECT *
    FROM EXTERNAL yourSQLDBDataSource EXECUTE 
        @"SELECT * FROM dbo.yourBigTable WITH (NOLOCK) WHERE yourDateCol Between '1 Jan 2016 and 31 Dec 2016'";


// Create the external query for year 2017
@results2017 =
    SELECT *
    FROM EXTERNAL yourSQLDBDataSource EXECUTE 
        @"SELECT * FROM dbo.yourBigTable WITH (NOLOCK) WHERE yourDateCol Between '1 Jan 2017 and 31 Dec 2017";


// Output 2016 results
OUTPUT @results2016
TO "/output/bigTable/results2016.csv"
USING Outputters.Csv();


// Output 2017 results
OUTPUT @results2017
TO "/output/bigTable/results2017.csv"
USING Outputters.Csv();

Now, I have created a different issue by breaking up the files into multiple parts. However you could then read these using filesets which will also parallelise, eg:

@input =
    EXTRACT 
            ... // your column list
    FROM "/output/bigTable/results{year}.csv"
    USING Extractors.Csv();

I would ask why you are choosing to move such a large file into your lake given ADLA and U-SQL offer the you ability to query data where it lives. Can you explain further?

U-SQL Paralell reading from SQL Table

3 Answers