I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following:
- I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
- For the initial pull, I want to query 12 months' worth of data from Snowflake. However, for any subsequent pull, I only need the last month since this should be a monthly pipeline.
- All monthly data pulls should be stored in own S3 subfolder like this
query_01012020,query_01022020,query_01032020etc. - The data load from S3 back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
- I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
- I want to get any error notifications in real-time when issues in the pipeline occur.
I hope you are able to guide me on relevant documentation/tutorials for this effort. I would truly appreciate the guidance.
Thank you very much.