0
votes

We queue a lot of tasks in Azure Batch and have 8 nodes in our pool to process the tasks. We now see strange behaviour (since 2 days ago).

  • The node boots
  • It starts processing the tasks
  • About 30 seconds later it stops picking up new tasks
  • It will finish off the existing tasks and not pick up new ones

The node now remains idle even though we have 1000+ tasks queued waiting to be processed by the pool.

Rebooting the node, brings it into an error state and then it will start up again, process several tasks and then stops picking up new tasks again.

What I've checked:

  • I'm able to remote into these nodes
  • No errors in event logs indicating issues
  • No major spikes in Disk, CPU, Memory
  • Scheduling is not disabled on the nodes

enter image description here

For visual reference:

  • Red blocks will not pick up new tasks
  • Blue blocks will finish what they're busy with.
  • Green block (2 nodes) continue to pick up tasks and process them successfully.

Is this a bug in the Azure Batch Scheduling? (since we haven't made any changes recently)

If not a bug, how can we get more info about what's happening with these nodes during scheduling?

1
You will need to open a support request so the service team can take a look at what's going on.fpark

1 Answers

0
votes

I opened up a support ticket with Microsoft. It turned out to be a scheduling bug which has now been fixed and all is working again