3
votes

I have a scenario in which I am ingesting data from a MS SQL DB into Azure Data Lake using U-SQL. My table is quite big, with over 16 millions records (soon it will be much more). I just do a SELECT a, b, c FROM dbo.myTable;

I realized, however, that only one vertex is used to read from the table.

enter image description here

My question is, is there any way to leverage parallelism while reading from a SQL table?

3
Hi, have you considered Azure Data Factory? Is the MS SQL DB also in Azure if so IaaS or PaaS?Alex KeySmith
I did consider that option. And yes extracting CSV files in U-SQL can be parallelized by default. But I was hoping to use only U-SQL. Thanks for your input!candidson

3 Answers

4
votes

I don't believe parallelism for external data sources is supported yet for U-SQL (although happy to be corrected). If you feel this is an important missing feature you can create a request and vote for it here:

https://feedback.azure.com/forums/327234-data-lake

As a workaround, you could manually parallelise your queries, depending on the columns available in your datasource. eg by date

// External query working
USE DATABASE yourADLADB;

// Create the external query for year 2016
@results2016 =
    SELECT *
    FROM EXTERNAL yourSQLDBDataSource EXECUTE 
        @"SELECT * FROM dbo.yourBigTable WITH (NOLOCK) WHERE yourDateCol Between '1 Jan 2016 and 31 Dec 2016'";


// Create the external query for year 2017
@results2017 =
    SELECT *
    FROM EXTERNAL yourSQLDBDataSource EXECUTE 
        @"SELECT * FROM dbo.yourBigTable WITH (NOLOCK) WHERE yourDateCol Between '1 Jan 2017 and 31 Dec 2017";


// Output 2016 results
OUTPUT @results2016
TO "/output/bigTable/results2016.csv"
USING Outputters.Csv();


// Output 2017 results
OUTPUT @results2017
TO "/output/bigTable/results2017.csv"
USING Outputters.Csv();

Now, I have created a different issue by breaking up the files into multiple parts. However you could then read these using filesets which will also parallelise, eg:

@input =
    EXTRACT 
            ... // your column list
    FROM "/output/bigTable/results{year}.csv"
    USING Extractors.Csv();

I would ask why you are choosing to move such a large file into your lake given ADLA and U-SQL offer the you ability to query data where it lives. Can you explain further?

1
votes

Queries to external datasources are not automatically parallelized in U-SQL. (This is something we are considering for the future)

wBob's answer does give one option for achieving somewhat the same effect - though it of course requires you to manually partition and query the data using multiple U-SQL statements.

Please note that doing parallel read in a non-transacted environment can lead to duplicate or missed data if parallel writes occur at the source. So some care needs to be taken and the users will need to know the tradeoffs.

0
votes

Another potential solution here would be to create an HDInsight cluster backed by the same ADLS store as your ADLA account.

You can then use Apache Sqoop to copy the data in parallel from SQL server to a directory in ADLS, and then import that data (which will be split across multiple files) to tables using U-SQL.