We have a service that generates 3000 files per minute. The file size is less than 5 KB. These are stored in blob storage in azure. We need to concatenate these files and send it to S3. The final file size in s3 should be between 10MB -100MB (This data goes to snowflake through Snowpipe.). How can this be achieved in a fast and cost effective way.
Adding more information: What I have already tried:
1) Sending a blob create event to azure queue. Queue trigger function to load data to S3. Then using aws lambda to concatenate (but lambda usually times out)
2)Python code that uses multiprocessing that reads the azure queue and blob and then concatenate the data to create a 10 MB file and send it to S3. Tried running this code from an azure webjob.(Webjob only has 4 cores). This is not fast enough and not scalable.
I need a solution that can run tasks in parallel in most cost effective way and is scalable. It can be a batch process. The latency of data in S3 can be 24 hours. (Can not use azure batch as we have already exhausted the number of accounts for our subscription plan for some other process. ).
Any recommendations for ETL tools or services that will be best suited for this case.