I am fairly new to Azure and I have a task in hand to make use of any Azure Service (or group of azure services in integration together for that matter) to o download a million files in parallel from a third party Rest API endpoint, that returns one file at a time, using Azure Data Factory into Blob Storage?
WHAT I RESEARCHED :
From what I researched my task had three requirements in a nutshell :
- Parallel runs in millions - For this I deduced Azure Batch would be a good option as it lets run a large number of tasks in parallel on VMs ( it uses that concept for graphic rendering processes or Machine Learning Tasks)
- Save response from Rest API to Blob Storage : I found that Azure Data Factory is able to handle such ETL type of operation from a Source/Sink style, where I could set the REST API as source and target as blob.
WHAT I HAVE TRIED:
Here are some things to note:
- I added the REST API and Blob as linked services.
- The API endpoint takes in a query string param named : fileName
- I am passing the whole URL with the query string
- The Rest API is protected by Bearer Token, which I am trying to pass using additional headers.
THE MAIN PROBLEM:
- I get an error message on publishing pipeline that model is not appropriate, just that one line, and it gives no insight what's wrong
OTHER QUERIES:
- It is possible to pass query string values dynamically from a sql table such that each filename can be picked a single row/column item from single columned rows of data from stored procedure/inline query?
- Is it possible to make this pipeline run in parallel using Azure Batch somehow? How can we integrate this process ?
- Is it possible to achieve the million parallel without data factory just using Batch ?