In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve
4 Answers
There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html
There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages. In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.
In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
- detect S3 change
- get S3 object path
- parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
- Call SnowPipe REST API
What you need:
- Create a user with a public key
- Create your stage on SnowFlake with AWS credential in order to access S3
- Create your pipe on Snowflake with your user role
- Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.