TL;DR: Is there any way to get SGE to round-robin between servers when scheduling jobs, instead of allocating all jobs to the same server whenever it can?
Details:
I have a large compute process that consists of many smaller jobs. I'm using SGE to distribute the work across multiple servers in a cluster.
The process requires a varying number of tasks at different points in time (technically, it is a DAG of jobs). Sometimes the number of parallel jobs is very large (~1 per CPU in the cluster), sometimes it is much smaller (~1 per server). The DAG is dynamic and not uniform so it isn't easy to tell how many parallel jobs there are/will at any given point.
The jobs use a lot of CPU but also do some non trivial amount of IO (especially at job startup and shutdown). They access a shared NFS server connected to all the compute servers. Each compute server has a narrower connection (10Gb/s) but the NFS server has several wide connections (40Gbs) into the communication switch. Not sure what the bandwidth of the switch backbone is, but it is a monster so it should be high.
For optimal performance, jobs should be scheduled across different servers when possible. That is, if I have 20 servers, each with 20 processors, submitting 20 jobs should run one job on each. Submitting 40 jobs should run 2 on each, etc. Submitting 400 jobs would saturate the whole cluster.
However, SGE is perversely intent on minimizing my I/O performance. Submitting 20 jobs would schedule all of them on a single server. So they all fight for a single measly 10Gb network connection when 19 other machines with a bandwidth of 190Gb sit idle.
I can force SGE to execute each job on a different server in several ways (using resources, using special queues, using my parallel environment and specifying '-t 1-', etc.). However, this means I will only be able to run one job per server, period. When the DAG opens up and spawns many jobs, the jobs will stall waiting for a completely free server while 19 out of the 20 processors of each machine will stay idle.
What I need is a way to tell SGE to to assign each job to the next server that has an available slot in a round-robin order. A better way would be to assign the job to the least loaded server (maximal number of unused slots, or maximal fraction of unused slots, or minimal number of used slots, etc.). But a dead simple round-robin would do the trick.
This seems like a much more sensible strategy in general, compared to SGE's policy of running each job on the same server as the previous job, which is just about the worst possible strategy for my case.
I looked over SGE's configuration options but I couldn't find any way to modify the scheduling strategy. That said, SGE's documentation isn't exactly easy to navigate, so I could have easily missed something.
Does anyone know of any way to get SGE to change its scheduling strategy to round-robin or least-loaded or anything along these lines?
Thanks!