0
votes

Consider a data processing pipeline as follows:

  1. Fetch a large amount of data from a REST API that's hosted somewhere on the internet and persist it to a data store.
  2. Perform some complex data transformations on the persisted data.
  3. Persist the results of the data transformations on a data store.

Aiming to implement such a pipeline in Azure, steps 2 and 3 seem like a good fit for implementation as Azure Data Factory activities.

My questions is - Does it make sense to implement step 1 in an Azure Data Factory activity as well?

Technically it might be possible to code a .Net activity that perform the data download and persistence.

3

3 Answers

1
votes

No - do not implement step 1 in an Azure Data Factory activity.

Technically it is possible to run the entire process from ADF but I would argue that the choice is more costly (relatively) than other options available to you because you will pay for each activity in Azure Data Factory.

For instance, what if the rest api does not have any new data to offer when you initiate the (scheduled) activity? You'll pay for that.

You might consider the following as an easy to implement alternative: 1 - Create a .NET console app, publish as a WebJob, schedule to run daily. 2 - The long-running console app can query the rest api, persist data into azure storage / documentdb, push a message into queue which triggers ADF steps 2/3 to run against the saved data.

1
votes

I have done exactly that using .Net Activity. I had a need to fetch data from Salesforce api. This has been working well for my needs. Here is a post I wrote up about creating a .net activity and storing the data in azure data lake.

As in Newport99's answer yes you will incur costs for that activity but I am not sure how cost effect it would be to be running a separate web app to host a web job and also run the Azure Data Factory pipeline. When I was originally designing a solution the WebJob was my first choice but in the end I prefer to have the whole solution utilizing one azure service instead of multiple.

Hope that helps.

1
votes

There have been a lot of improvements to ADF in the years since this question was posted, including a REST connector. Here's the approach recommended by ADF at this time...

Copy data from a REST endpoint by using Azure Data Factory