0
votes

I'm expecting to stream 10,000 (small, ~ 10KB) files per day into Snowflake via S3, distributed evenly throughout the day. I plan on using the S3 event notification as outlined in the Snowpipe documentation to automate. I also want to persist these files on S3 independent of Snowflake. I have two choices on how to ingest from S3:

s3://data-lake/2020-06-02/objects
              /2020-06-03/objects
                 .
                 .
              /2020-06-24/objects

or

s3://snowpipe specific bucket/objects

From a best practices / billing perspective, should I ingest directly from my data lake - meaning my 'CREATE or replace STORAGE INTEGRATION' and 'CREATE or replace STAGE' statements references top level 's3://data-lake' above? Or, should I create a dedicated S3 bucket for the Snowpipe ingestion, and expire the objects in that bucket after a day or two?

Does Snowpipe have to do more work (and hence bill me more) to ingest if I give it a top level folder that has thousands and thousand and thousands of objects in it, than if I give it a small tight, controlled, dedicated folder with only a few objects in it? Does the S3 notification service tell Snowpipe what is new when the notification goes out, or does Snowpipe have to do a LIST and compare it to the list of objects already ingested?

Documentation at https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html doesn't offer up any specific guidance in this case.

1
Automation of Snowpipe with SQS looks best fit for your requirement:docs.snowflake.com/en/user-guide/…Ankur Srivastava
I have no conflicting notifications on the S3 bucket, and I do not need to inform any other task of the S3 change. Those are the two documented differences between the two approaches. Is the notification itself different between the two approaches? If so, it is not documented.jo5729

1 Answers

0
votes

The INTEGRATION receives a message from AWS whenever a new file is added. If that file matches the fileformat, file path, etc. of your STAGE, then the COPY INTO statement from your pipe is run on that file.

There is minimal overhead for the integration to receive extra messages that do not match your STAGE filters, and no overhead that I know of for other files in that source.

So I am fairly certain that this will work fine either way as long as your STAGE is set up correctly.

We have been using a similar setup with ~5000 permanent files per day into a single Azure storage account with files divided into different directories that correspond to different Snowflake STAGEs for the last 6 months with no noticeable extra lag on the copying.