1
votes

I am fairly new to Azure and I have a task in hand to make use of any Azure Service (or group of azure services in integration together for that matter) to o download a million files in parallel from a third party Rest API endpoint, that returns one file at a time, using Azure Data Factory into Blob Storage?

WHAT I RESEARCHED :

From what I researched my task had three requirements in a nutshell :

  • Parallel runs in millions - For this I deduced Azure Batch would be a good option as it lets run a large number of tasks in parallel on VMs ( it uses that concept for graphic rendering processes or Machine Learning Tasks)
  • Save response from Rest API to Blob Storage : I found that Azure Data Factory is able to handle such ETL type of operation from a Source/Sink style, where I could set the REST API as source and target as blob.

WHAT I HAVE TRIED:

Here are some things to note:

  • I added the REST API and Blob as linked services.
  • The API endpoint takes in a query string param named : fileName
  • I am passing the whole URL with the query string
  • The Rest API is protected by Bearer Token, which I am trying to pass using additional headers.

THE MAIN PROBLEM:

  • I get an error message on publishing pipeline that model is not appropriate, just that one line, and it gives no insight what's wrong

OTHER QUERIES:

  • It is possible to pass query string values dynamically from a sql table such that each filename can be picked a single row/column item from single columned rows of data from stored procedure/inline query?
  • Is it possible to make this pipeline run in parallel using Azure Batch somehow? How can we integrate this process ?
  • Is it possible to achieve the million parallel without data factory just using Batch ?
1

1 Answers

0
votes

Hard to help with you main problem - you need to provide more examples of your code

In relation to your other queries:

  • You can use a "Lookup activity" to fetch a list of files from a database (with either sproc or inline query). The next step would be a ForEach activity that iterates over the array and copies the file from the REST endpoint to the storage account. You can adjust the parallelism on the ForEach activity to match your requirement but around 20 concurrent executions is what you normally see.

  • Using Azure Batch to just download a file seems a bit overkill as it should be a fairly quick operation. If you want to see an example of a Azure Batch job written in C# I can recommend this example => `https://github.com/Azure-Samples/batch-dotnet-quickstart/blob/master/BatchDotnetQuickstart. In terms of parallelism I think you will manage to achieve a higher degree on Azure Batch compared to Azure Data Factory.

  • In you need to actually download 1M files in parallel I don't think you have any other option then Azure Batch to get close to such numbers. But you most have a pretty beefy API if it can handle 1M requests within a second or two.