0
votes

I'm trying to figure out a solution for recurring data aggregation of several thousand remote XML and JSON data files, by using Azure queues and WebJobs to fetch the data.

Basically, an input endpoint URL of some sort would be called (with a data URL as parameter) on an Azure website/app. It should trigger a WebJobs background job (or can it continuously running and checking the queue periodically for new work), fetch the data URL and then callback an external endpoint URL on completion.

Now the main concern is the volume and its performance/scaling/pricing overhead. There will be around 10,000 URLs to be fetched every 10-60 minutes (most URLs will be fetched once every 60 minutes). With regards to this scenario of recurring high-volume background jobs, I have a couple of questions:

  1. Is Azure WebJobs (or Workers?) the right option for background processing at this volume, and be able to scale accordingly?

  2. For this sort of volume, which Azure website tier will be most suitable (comparison at http://azure.microsoft.com/en-us/pricing/details/app-service/)? Or would only a Cloud or VM(s) work at this scale?

Any suggestions or tips are appreciated.

1

1 Answers

1
votes
  1. Yes, Azure WebJobs is an ideal solution to this. Azure WebJobs will scale with your Web App (formerly Websites). So, if you increase your web app instances, you will also increase your web job instances. There are ways to prevent this but that's the default behavior. You could also setup autoscale to automatically scale your web app based on CPU or other performance rules you specify.
    It is also possible to scale your web job independently of your web front end (WFE) by deploying the web job to a web app separate from the web app where your WFE is deployed. This has the benefit of not taking up machine resources (CPU, RAM) that your WFE is using while giving you flexibility to scale your web job instances to the appropriate level. Not saying this is what you should do. You will have to do some load testing to determine if this strategy is right (or necessary) for your situation.

  2. You should consider at least the Basic tier for your web app. That would allow you to scale out to 3 instances if you needed to and also removes the CPU and Network I/O limits that the Free and Shared plans have.

As for the queue, I would definitely suggest using the WebJobs SDK and let the JobHost (from the SDK) invoke your web job function for you instead of polling the queue. This is a really slick solution and frees you from having to write the infrastructure code to retrieve messages from the queue, manage message visibility, delete the message, etc. For a working example of this and a quick start on building your web job like this, take a look at the sample code the Azure WebJobs SDK Queues template punches out for you.

enter image description here