All -
I have some issues using ADLA joining tables and loading them into Azure SQL DW.
I am stuck on one of the loading process and not sure how can fix that problem. I have some purchase information which when I pull from on-prem system into ADLS creates output file of around 25 MB each and when I try to join them to using "INNER JOIN" in ADLA job it keeps running for a long time. While running I see the output file keeps going on to more then 5+ GB. I feel that's too much for a 25 MB file and not sure if am doing anything wrong (screen capture attached for reference while running from VS).
The process I do is - 1- load data from on-prem db to cloud ADLS store using ADF 2- run ADLA job to pull ADLS file into ADLA table and partition using ROUND ROBIN option to avoid data skew issues 3- ADLA job would create an aggregated file in ADLS store 4- ADF pipeline to load #3 file into Azure SQL DW
Any pointers or help could be very helpful.
Thanks, Shabbir
Here is the code (I have removed all field names and client information). Hope this helps.
DECLARE @dir = "/{date:yyyy}/{date:MM}/{date:dd}/<FILE PATH>/";
DECLARE @in_1 string = <FILE #1>;
DECLARE @in_2 string = <FILE #2>;
DECLARE @out string = <OUTPUT FILE>;
@file1 =
EXTRACT
<25 COL. WITH ALL NULLABLE>
date DateTime
FROM @in_1
USING Extractors.Text(delimiter : '|');
@file2 =
EXTRACT
<40 COL. WITH ALL NULLABLE>
date DateTime
FROM @in_2
USING Extractors.Text(delimiter : '|');
@output =
SELECT
<25 + 40 COL. FROM INPUT FILES>
FROM @file1 AS f1
INNER JOIN @file2 AS f2
ON f1.join1 == f2.join2;
OUTPUT @purchase
TO @out
USING Outputters.Text(delimiter : '|');
This is run throw a ADF pipeline and using external parameters for date and file paths.