4
votes

I am moving a program parallelized by OpenMP to Cluster. The cluster is using Lava 1.0 as scheduler and has 8 cores in each nodes. I used a MPI wrapper in the job script to do multi-host parallel.

Here is the job script:

#BSUB -q queue_name
#BSUB -x

#BSUB -R "span[ptile=1]"
#BSUB -n 1

#BSUB -J n1p1o8
##BSUB -o outfile.email
#BSUB -e err

export OMP_NUM_THREADS=8

date
/home/apps/bin/lava.openmpi.wrapper -bynode -x OMP_NUM_THREADS \
    ~/my_program ~/input.dat ~/output.out 
date

I did some experiments on ONE host exclusively. However, I don't know how to explain some of the results.

1.
-nOMP_NUM_THREADStime
1      4      21:12      
2      4      20:12      

Does it mean MPI doesn't do any parallel here? I thought in second case every MPI process would have 4 OMP threads so it should use 800% CPU usage which should be faster than first one.

Another results to prove it is that
-nOMP_NUM_THREADStime
2      2      31:42      
4      2      30:47      

They also have pretty close run times.

2.
In this case, if I want to parallel this program in this cluster with reasonable optimized speed by simple way, is it reasonable to put 1 MPI process (tell LFG that I use one core) in every host, set OMP_NUM_THREADS = 8, and then run it exclusively? Therefore MPI only works on cross-node jobs and OpenMP works on inner node jobs. (-n = # of host; ptile = 1; OMP_NUM_THREADS = Max cores in each host)

UPDATE: The program is compiled by gfortran -fopenmp without mpicc. MPI is only used to distribute the executable.

UPDATE Mar.3: Program memory usage monitor

Local environment: Mac 10.8 / 2.9 Ghz i7 /8GB Memory

No OpenMP

  • Real memory size: 8.4 MB
  • Virtual memory size: 2.37 GB
  • Shared Memory Size: 212 KB
  • Private Memory Size: 7.8 Mb
  • Virtual Private Memory: 63.2 MB

With OpenMP (4 threads)

  • Real memory size: 31.5 MB
  • Virtual memory size: 2.52 GB
  • Shared Memory Size: 212 KB
  • Private Memory Size: 27.1 Mb
  • Virtual Private Memory: 210.2 MB

Cluster hardware brief info

Each host contains dual quad chips which is 8 cores per node and 8GB memory. The hosts in this cluster are connected by infiniband.

1
Most people here are not psychics. You could start by describing your computational algorithm, its memory requirements and computational intensity (preferably in FLOPS/byte). Then tell us what hardware each cluster node has in terms of CPU type, # of sockets, memory bandwidth, and so on.Hristo Iliev
@HristoIliev Thank you for you reply! My program is using OpenMP to parallelize a loop in fortran. There are also some allocate/deallocate memory in the loop. Could you tell me how to get the memory requirements and computational intensity. I am asking cluster manager about the hardware information. Thanks again.Ling0k
How about editing your answer and providing some sample kernel loops? There is no need to paste the entire program - just as much code as to give the idea of how much information is being processed. My point is that if your application is memory-bound, then the available memory bandwidth is what limits the speed-up.Hristo Iliev
@HristoIliev The code is actually pretty complicated and long. It is pretty hard to select "important" part to parse here since I am not sure which parts could be "important". Instead I did a memory monitor in my local environment and updated the memory info in the clusters. Please let me know if you need more information. Thank you so much!Ling0k
So you are only running the same executable multiple times. How then you expect that it would run faster with more processes, when they are not cooperating and sharing their work? MPI is a communication library and you must explicitly put MPI calls inside your application and have different processes work on different parts of the problem in order to have shorter processing times. Simply launching your executable multiple times would not magically make it run faster.Hristo Iliev

1 Answers

6
votes

Taking into account the information that you have specified in the comments, your best option is to:

  • request exclusive node access with -x (you already do that);
  • request one slot per node with -n 1 (you already do that);
  • set OMP_NUM_THREADS to the number of cores per node (you already do that);
  • enable binding of OpenMP threads;
  • launch the executable directly.

Your job script should look like this:

#BSUB -q queue_name
#BSUB -x
#BSUB -n 1

#BSUB -J n1p1o8
##BSUB -o outfile.email
#BSUB -e err

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=true

date
~/my_program ~/input.dat ~/output.out
date

OMP_PROC_BIND is part of OpenMP 3.1 specification. If using compiler which adheres to an older version of the standard, you should use the vendor-specific setting, e.g. GOMP_CPU_AFFINITY for GCC and KMP_AFFINITY for Intel compilers. Binding threads to cores prevents the operating system from moving around threads between different processor cores, which speeds up the executing, especially on NUMA systems (e.g. machines with more than one CPU sockets and separate memory controller in each socket) where data locality is very important.

If you'd like to run many copies of your program over different input files, then submit array jobs. With LSF (and I guess with Lava too) this is done by changing the job script:

#BSUB -q queue_name
#BSUB -x
#BSUB -n 1

#BSUB -J n1p1o8[1-20]
##BSUB -o outfile.email
#BSUB -e err_%I

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=true

date
~/my_program ~/input_${LSF_JOBINDEX}.dat ~/output_${LSF_JOBINDEX}.out
date

This submits an array job of 20 subjobs (-J n1p1o8[1-20]). %I in -e is replaced by the job number so you'll get a separate err file from each job. The LSF_JOBINDEX environment variable is set to the current job index, i.e. it will be 1 in the first job, 2 in the second and so on.


My question about the memory usage of your program was not about how much memory does it consume. It was about how large is the typical dataset that is processed in a single OpenMP loop. If the dataset is not small enough to fit into the last-level cache of the CPU(s), then memory bandwidth comes into consideration. If your code does heavy local processing on each data item, then it might scale with the number of threads. If on the other side it does simple and light processing, then memory bus might get saturated even by a single thread, especially if the code is properly vectorised. Usually this is measured by the so-called operational intensity in FLOPS/byte. It gives the amount of data processing that happens before the next data element is fetched from memory. High operational intensity means that a lot of number crunching happens in the CPU and data is only seldom transferred to/from memory. Such programs scale almost linearly with the number of threads, no matter what the memory bandwidth is. On the other side, codes with very low operational intensity are memory-bound and they leave the CPU underutilised.

A program that is heavily memory-bound doesn't scale with the number threads but with the available memory bandwidth. For example, on a newer Intel or AMD system, each CPU socket has its own memory controller and memory data path. On such systems the memory bandwidth is a multiple of the bandwidth of a single socket, e.g. a system with two sockets delivers twice the memory bandwidth of a single-socket system. In this case you might see improvement in the code run time whenever both sockets are used, e.g. if you set OMP_NUM_THREADS to be equal to the total number of cores or if you set OMP_NUM_THREADS to be equal to 2 and tell the runtime to put both threads on different sockets (this is a plausible scenario when threads are executing vectorised code and a single thread is able to saturate the local memory bus).