5
votes

I'm currently trying to divide a processor-intensive simulation task into a few hundred chunks that are processed in parallel within Azure. I thought that Azure WebSites which offer an easy-to-setup dedicated virtual machine and WebJobs with their easy-to-use abstraction over a Storage Queue would fit my bill perfectly.

I have the following Azure setup that gets freshly created by my code each time I run it

  • A single storage account
  • One storage queue with job descriptions
  • A storage container with static data
  • A storage container for the results (unique files per job)
  • n (for example 8) "Standard" WebSites, meaning there are n different *.azurewebsites.net URIs
  • One WebJob on each WebSite running continuously (8 WebJobs in the example) using the WebJobs SDK (JobHost)
  • Each job description is <1k
  • Each job consists of about 100k of Blob-input-data
  • Each result is about 100k of Blob-output-data
  • With the current scaling, each job runs for about one and a half minutes

Here is the signature of the job.

public static void RunGeant4Simulation(
    [QueueTrigger("simulationjobs")] JobDescription jobDescription,
    [Blob("input/{Archive}", FileAccess.Read)] Stream archive,
    [Blob("result/{Name}-{Energy}-output.zip", FileAccess.Write)] Stream output,
    [Blob("result/{Name}-{Energy}-log.dat")] TextWriter debug
)

The code then goes ahead to setup a WebSite-local, job-specific directory, extracts the zip-archive containing an executable, runs this executable with Process.Start and writes the captured output to the blob. Everything the Process accesses is available on the machine. The debug TextWriter is for capturing timing information within the job.

What I expected to see was that each WebSite would take a job from the queue, run it, post the results into the container and take the next job.

What I'm actually seeing is that only a single WebSite is actually running jobs while the remaining ones just idle, although the WebJob is reported as being started and running on each site. The net result is the same number of jobs finished per minute as with one WebSite. Here is a log of a run, where two WebSites "decided" to participate in running jobs: simulation-log.zip. The storage account mentioned in the connection strings is already deleted, so I did not delete the access keys from the logs.

I have added some timing instrumentation to the WebJob and from that I can see that sometimes running the executable takes twice or thrice (pretty much exactly) the time it would take in a "normal" run

stopwatch.Start();
using (var process = Process.Start(processStartInfo))
{
    debug.WriteLine("After Starting Process: {0}", DateTime.UtcNow);
    var outputData = process.StandardOutput.ReadToEnd();

    process.WaitForExit();

    stopwatch.Stop();
    debug.WriteLine("Process Finished: {0} {1}", DateTime.UtcNow, stopwatch.Elapsed);

    outputBytes = Encoding.UTF8.GetBytes(outputData);
}

The stopwatch shows times of 1:15, 2:27, 3:43, etc. But some of the jobs that take longer than expected also show an expected time for the stopwatch. However, in both cases, jobs on another WebSite run instead and in the storage's result container, results show up. In the end, the number of jobs finished per minute does not change.

Update

Today, I went one step further and created a separate storage account per WebSite and distributed the jobs manually between 8 queues in 8 storage accounts each for one of 8 WebSites. That means from my outside perspective, nothing had anything in common besides running the same code by accident.

This did not help.

It still looks like I have one single processor that has to run all WebJobs on whatever WebSite I create no matter how independent they are. I have created an image of the CPU Time as shown in the portal: CPU Time as shown in the portal

1
Can you please share the log files where it shows that the jobs are running? Did you configure the connection strings correctly for each job instance? Also, if you can share some code, that would be fantastasticVictor Hurdugaci
The WebJobs are uploaded via FTP and I generate the *.config file on the fly containing the connection string for the newly created storage account. The one time the connection strings were wrong, the WebJob would remain in a "Pending restart" loop.Tragetaschen

1 Answers

0
votes

My thinking about Azure WebSites was actually wrong and that's why I got confused:

In non-Free WebSites, there are two things that scale completely independently

  • Computing power available for all those WebSites (a "ServerFarm" in the SDK). This means you select a machine size (Small to Large) and a number of those ("Instances") and these are responsible to run all your Basic or Standard WebSites.
  • Software running on an URI like ASP.NET, PHP, or WebJobs

In my thinking, WebSites were directly linked to virtual machine(s) backing them up, but there is no direct connection.

I have now a ServerFarm with n Large instances. In this ServerFarm, there are n WebSites. Each WebSite has 5 WebJobs, so that the 4 Processors in a Large instance can be used thoroughly.

Now, everything scales as expected.