2
votes

I want to parallelize an R script on a HPC with a Slurm scheduler.

SLURM is configured with SelectType: CR_Core_Memory.

Each compute node has 16 cores (32 threads).

I pass the R script to SLURM with the following configuration using the clustermq as the interface to Slurm.

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=normal
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 2048 }}
#SBATCH --cpus-per-task={{ n_cpus }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --ntasks={{ n_tasks }}
#SBATCH --nodes={{ n_nodes }}

#ulimit -v $(( 1024 * {{ memory | 4096 }} ))
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Within the R script I do "multicore" parallelization with 30 cores. I would like to use cores from multiple nodes to satisfy the requirement of 30 cpus, i.e. 16 cores from node1, 14 from node2.

I tried using n_tasks = 2 and cpus-per-task=16. With this, the job gets assigned to two nodes. However, only one node is doing compuation (on 16 cores). The second node is assigned to the job but does nothing.

In this question srun is used to split parallelism across nodes with foreach and Slurm IDs. I do not neither use srun nor foreach. Is there a way to achieve what I want with SBATCH and multicore parallelism?

(I know that I could use SelectType=CR_CPU_Memory and have 32 threads available per node. However, the question is how to use cores/threads from multiple nodes in general to be able to scale up parallelism).

1
Is it necessary to use SLURM? Does each node have an IP? If so, it is easy to work with the parallel package to parallelize through several computers.lcgodoy
Each node has an ip but it's only internal and not public. Only the master node has a public ip. And yes, using Slurm is required here.pat-s
I may not understand what you're asking, but it sounds like you're trying to parallelize a node-threaded parallel job across nodes... you can't do that with a typical SLURM setup. Those nodes do not share memory, so you can't combine CPUs from different nodes.nsheff
@nsheff Are there other ways to achieve what I want? I can't believe this is not possible :)pat-s
Maybe @damienfrancois has an idea?pat-s

1 Answers

1
votes

Summary from my comments:

The answer is you cannot do this because your task is using a bunch of CPUs from within a single R process. You're asking a single R process to parallelize a task across more CPUs than the physical machine has. You cannot split a single R process across multiple nodes. Those nodes do not share memory, so you can't combine CPUs from different nodes, at least not with typical cluster architecture. It's possible if you had a distributed operating system like DCOS.

In your case, the solution is that you need to do is split your job up outside of those R processes. Run 2 (or 3, or 4) separate R processes, each on its own node, and then restrict each R process to the maximum number of CPUs your machines have.