I have a bunch of U-SQL activities that manipulates & transform data in an Azure Data Lake
. Out of this, I get a csv
file that contains all my events.
Next I would just use a Copy Data
activity to copy the csv
file from the Data Lake directly into an Azure SQL Data Warehouse
table.
- I extract the information from a bunch of
JSON
files stored in the Data Lake and create a staging.csv
file; - I grab the staging
.csv
file & a production.csv
file and inject the latest change (and avoid duplicates) and save the production.csv
file; - Copy the
.csv
production file directly to the Warehouse table.
I realized that my table contains duplicated rows and, after having tested the U-SQL
scripts, I assume that the Copy Data
activity -somehow- merges the content of the csv
file into the table.
Question
I am not convinced I am doing the right thing here. Should I define my warehouse table as an EXTERNAL table that would get its data from the .csv
production file? Or should I change my U-SQL to only include the latest changes?