0
votes

I am following this tutorial https://docs.microsoft.com/en-us/azure/batch/tutorial-parallel-python on how to use Azure Batch API.

It is not clear from that article however what happens to the nodes in the pool after the batch run is complete.

The reason I care is that the nodes I will be using will need extensive setup before the batch begins. Is it possible to have the VMs retain that setup between runs so as to save the bandwidth and time required for setup?

And also, what if that setup requires a restart (after installing GPU drivers for instance), would that be possible to do before the cluster is used?

1

1 Answers

1
votes

The line batch_client.pool.delete(_POOL_ID) causes the pool to be deleted, which causes all its nodes to be deleted too. The way to keep the VMs between runs is to not delete the pool, and just submit your next job to the same pool.

Regarding your extensive setup including GPU drivers and rebooting, I assume you are doing this via the pool Start Task. I think having a reboot in the Start Task (assuming it is the last command) should be ok, although I haven't tried this. The node should then be ready after the reboot.

Perhaps a better option would be to use a custom VM image that contains all the complex setup already or alternatively use a Docker container which can also include such complex setup (including installing CUDA etc).