0
votes

I am trying to build an Azure Data Factory pipeline that grabs data from an Azure Data Lake and transforms it.

I want to store the transformed data in an Azure SQL Data Warehouse (as facts and dimensions).

I ended up with two solutions to make the data available in the warehouse:

  1. Use CREATE EXTERNAL TABLE and reference the transformed data that sit in the Lake in .csv files;
  2. Use CTAS to copy the .csv files from the Lake into the Warehouse;

Question

It is not obvious to me what the tradeoffs in play. On one hand, it seems to me that the 1st option avoids copies to the warehouse (is it though?). But the 2nd option would offer more options such as indexing.

What are the pros & cons of CREATE EXTERNAL TABLE compared to CTAS?

1

1 Answers

3
votes

The main benefit of landing the data in the DW (using CTAS) is that subsequent queries against that data will be significantly faster. You'll get full distribution of the data across your nodes and distributions. Any queries that filter or join the table will be substantially faster.

The downside, as you point out, is that you're now storing another copy of the data ... assuming that you don't delete the file from your data lake once ingested.

Think of Create External Table as simply providing a mapping of the external file to a SQL data structure.