I'm expecting to stream 10,000 (small, ~ 10KB) files per day into Snowflake via S3, distributed evenly throughout the day. I plan on using the S3 event notification as outlined in the Snowpipe documentation to automate. I also want to persist these files on S3 independent of Snowflake. I have two choices on how to ingest from S3:
s3://data-lake/2020-06-02/objects
/2020-06-03/objects
.
.
/2020-06-24/objects
or
s3://snowpipe specific bucket/objects
From a best practices / billing perspective, should I ingest directly from my data lake - meaning my 'CREATE or replace STORAGE INTEGRATION' and 'CREATE or replace STAGE' statements references top level 's3://data-lake' above? Or, should I create a dedicated S3 bucket for the Snowpipe ingestion, and expire the objects in that bucket after a day or two?
Does Snowpipe have to do more work (and hence bill me more) to ingest if I give it a top level folder that has thousands and thousand and thousands of objects in it, than if I give it a small tight, controlled, dedicated folder with only a few objects in it? Does the S3 notification service tell Snowpipe what is new when the notification goes out, or does Snowpipe have to do a LIST and compare it to the list of objects already ingested?
Documentation at https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html doesn't offer up any specific guidance in this case.