0
votes

I have a bunch of U-SQL activities that manipulates & transform data in an Azure Data Lake. Out of this, I get a csv file that contains all my events.

Next I would just use a Copy Data activity to copy the csv file from the Data Lake directly into an Azure SQL Data Warehouse table.

Extract-Transform-Merge-Load

  1. I extract the information from a bunch of JSON files stored in the Data Lake and create a staging .csv file;
  2. I grab the staging .csv file & a production .csv file and inject the latest change (and avoid duplicates) and save the production .csv file;
  3. Copy the .csv production file directly to the Warehouse table.

I realized that my table contains duplicated rows and, after having tested the U-SQL scripts, I assume that the Copy Data activity -somehow- merges the content of the csv file into the table.

Question

I am not convinced I am doing the right thing here. Should I define my warehouse table as an EXTERNAL table that would get its data from the .csv production file? Or should I change my U-SQL to only include the latest changes?

1

1 Answers

0
votes

If you want to use external tables depends on your use case. If you want the data to be stored inside SQL DW for better performance, you have to copy it at some point, e.g. via a stored procedure. You could then just call the stored procedure from ADF, for instance.

Or, if you don't want to / cannot filter out data beforehand, you could also implement an "Upsert" stored procedure in your SQL DW and call this to insert your data instead of the copy activity.