Snowflake load data from S3(COPY) vs load from EXTERNAL table

Question

What is difference in performance for below two cases, which one is cost optimised in case of execution as well as storage.

Create external table with files in S3, and load("INSERT INTO.... SELECT..." ) to Snowflake target from external table.
Load data to temporary table from S3 and then load("INSERT INTO.... SELECT..." ) to Snowflake target from temporary table.

Hi - can you clarify what you mean by a temporary table? Where is this temporary table and how are you loading data into it? When you say "load(INSERT)" do you mean using an "INSERT INTO.... SELECT..." statement or are you using a "COPY INTO..." statement? — NickW
Using COPY command to load data to TEMP table in Snowflake, then using "INSERT INTO.... SELECT..." command for data loading. Please let me know if still any clarification required. — rakesh singh
COPY is much faster than INSERT in Snowflake. However I would not use either of the two options that you have listed. Why not COPY the S3 files directly into the target table? Is there a specific reason why you are thinking about creating external/temp tables? — NickW
Yes there is a specific reason, want to push down logic(join, filter, aggregator, functions) from different source(S3, Azure Gen2, Google storage). Could you please share some insight how EXTERNAL table works, under the hood is it using any temp storage? Because "INSERT INTO.... SELECT..." works for external table. — rakesh singh

NickW NickW · Accepted Answer · 2020-09-17T11:37:15

OK - if you want to use the power of the Snowflake platform as much as possible (pushdown optimisation) then you need to get your data into Snowflake as efficiently as possible first and then run your SQL queries against it ((join, filter, aggregator, etc). Use COPY to move your S3/Azure/Google files into Snowflake tables and then run INSERT... SELECT against these.

There is no reason to create EXTERNAL tables and, if you do, it will perform much worse than the approach I have proposed.

External Tables - short explanation

For the sake of simplicity, let's assume that your Snowflake instance is running on AWS and you also have some files in an S3 bucket.

All your Snowflake data is being stored in S3 by Snowflake but in a heavily compressed and optimised format. Snowflake holds metadata about where and what your data is that allows it to present your data as tables/columns.

An External table is basically exactly the same thing: Snowflake holds metadata about the files in your S3 bucket that allows it to present the data as tables/columns. The differences are that:

The external data is not being compressed/optimised and therefore it takes up more storage and is slower to query
Effectively, the query engine for your external table is the S3 environment (unless/until Snowflake has read the data into memory/cache where it can then process it as though it was Snowflake data)
There is probably some element of network latency - depending on where your Snowflake account and S3 buckets are located in the global AWS infrastructure

Hope this helps?

Snowflake load data from S3(COPY) vs load from EXTERNAL table

1 Answers