1
votes

I am looking for the best way to automatically detect new files in a S3 bucket and then to load the data into a Snowflake table.

I know this can be achieved using Snowpipe and SNS, SQS notifications set up in AWS but I would like to have a self-contained solution within Snowflake which can be used for multiple data sources.

I want to have a table which is updated with the file names from a S3 bucket and then to load files which have not already been loaded from S3 into Snowflake.

The only way I have found to automatically detect new files from an external S3 stage in Snowflake so far is to use the code below and a task on a set schedule. This lists the file names and then uses result_scan to display the last query as a table.

list @STAGE_NAME;
set qid=last_query_id();
select "name" from table(result_scan($qid))

Does anyone know a better way to automatically detect new files in an external stage from Snowflake? Any help is much appreciated.

2
As far I know, it's the only way.Gokhan Atil
Snowipipe and using tasks along with the result_scan are the only ways!Abhi Reddy
Thanks! One concern I have on this approach is how to ensure the last query run is always the list statement. Would putting this in a procedure ensure it always captures the correct ID? Have you run into any difficulties when combining this with tasks and procedures?martins

2 Answers

0
votes

Not necessarily better than the way you've already found, but there is an alternative approach to listing the files in an S3 bucket.

If you create an EXTERNAL TABLE over the data in S3, you can then use the METADATA$FILENAME property in a query. If you have a record of which files have already been loaded into Snowflake then you can compare and select the names of the new files and process them.

e.g.

ALTER EXTERNAL TABLE MYSCHEMA.MYEXTERNALTABLE REFRESH;

SELECT DISTINCT
   METADATA$FILENAME as filename
FROM 
   MYSCHEMA.MYEXTERNALTABLE;
0
votes

Short Run:

Your approach

You've already found a viable solution, and your concern about the reliability of the last query id function is understandable. Procedures' sessions are isolated and so the last_query_id() function will be isolated to only the statements executed within that procedure. It might be unnecessary to use a procedure, but I personally like that they let you create reusable abstractions.

Another approach

An alternative, if you don't like the approach you're using, would be to create a single table with a single VARIANT data column plus the stage metadata columns, maintained by a single giant pipe, and you could maintain a set of materialized views over that table, which would filter, convert variant fields to columns, and sanitize, as appropriate.

There are some benefits:

  • simpler: integrating new prefixes for a stage requires only an additional materialized view, not an additional pipe + task
  • more control: you'd be able to operate directly and automatically on the data in raw form, rather than needing to load into a table and then check it. This means you can perform data quality checks, metadata checks, and sanitization.
  • maintainable: the use of materialized views over an immutable source means you can at any time change the logic and perform a full backfill with little effort.

Long Run:

Notification Integrations enable snowflake to listen (and possibly notify in the future, roadmap-gods willing) to external messaging systems. At this moment only Azure is supported, so it won't work for your case, but keep an eye out over the next few months -- I think it's safe to speculate that we will see this feature grow to support AWS, and a more direct and concise manner for implementing your original solution will eventually become available.