U-SQL Inner Join creates a huge big write file

Question

All -

I have some issues using ADLA joining tables and loading them into Azure SQL DW.

I am stuck on one of the loading process and not sure how can fix that problem. I have some purchase information which when I pull from on-prem system into ADLS creates output file of around 25 MB each and when I try to join them to using "INNER JOIN" in ADLA job it keeps running for a long time. While running I see the output file keeps going on to more then 5+ GB. I feel that's too much for a 25 MB file and not sure if am doing anything wrong (screen capture attached for reference while running from VS).

The process I do is - 1- load data from on-prem db to cloud ADLS store using ADF 2- run ADLA job to pull ADLS file into ADLA table and partition using ROUND ROBIN option to avoid data skew issues 3- ADLA job would create an aggregated file in ADLS store 4- ADF pipeline to load #3 file into Azure SQL DW

Any pointers or help could be very helpful.

Thanks, Shabbir

Here is the code (I have removed all field names and client information). Hope this helps.

DECLARE @dir = "/{date:yyyy}/{date:MM}/{date:dd}/<FILE PATH>/";
DECLARE @in_1 string = <FILE #1>;
DECLARE @in_2 string = <FILE #2>;
DECLARE @out string = <OUTPUT FILE>;

@file1 =
    EXTRACT 
        <25 COL. WITH ALL NULLABLE>
        date DateTime
    FROM @in_1
    USING Extractors.Text(delimiter : '|');

@file2 =
    EXTRACT 
        <40 COL. WITH ALL NULLABLE>
        date DateTime
    FROM @in_2
    USING Extractors.Text(delimiter : '|');

@output =
    SELECT 
        <25 + 40 COL. FROM INPUT FILES>
    FROM @file1 AS f1
        INNER JOIN @file2 AS f2
        ON f1.join1 == f2.join2;

OUTPUT @purchase
TO @out
USING Outputters.Text(delimiter : '|');

This is run throw a ADF pipeline and using external parameters for date and file paths.

enter image description here

Bob, I have added code snippet in my questions. Thanks in advance. — Shabbir Mala
If your join is producing an increase in the data volume, a probable cause is that you have duplicated values in your join columns. If a value is present 100 times in the left table and 1000 times in the right table, that will lead to 100,000 lines in the output (cartesian product). Could you check that, for example by loading your data in Power BI? — Alexandre Gattiker
Yes. Alexandre. This looks like more of cartesian issue. I am good for now. Thanks! — Shabbir Mala

OmidA OmidA · Accepted Answer · 2017-04-10T04:33:27

It looks like you are not using ADLA tables, but just dealing with files directly in U-SQL. This is fine, but I just wanted to check since you mentioned tables. Also, the code is outputting @purchase which doesn't exists in the script - I assume that you are actually outputting the right rowset in your real script.

A few questions:

Have you checked the format/content of the input files in ADLS? Have you done an output of @file1 and @file2 to check they create what you expect?
How do you know that the output file is growing to 5 GB? From the job graph? Can you attach that?
Are you using the "date" column for something?
Is your join column a single column, and what type is it?

U-SQL Inner Join creates a huge big write file

2 Answers