1
votes

I am using a Fortran code to run a large scale simulation on a supercomputer. I am able to run the code in serial, but I want to improve the turn around time. I am looking in to making it parallel and i have found that I can use auto-parallelization or MPI, the question I have is: which is more likely to improve the turn around time?

I was able to use Intel Fortran complier with the compiler flag -parallel -par-report to see which DO loops where made parallel, so if I run the complied code on 4 processors would that actually work or do I have to do something special?

In addition, do you know of any useful resources for me too learn MPI. I want to be able to use more processors to increase the simulation time that is my end goal.

2
If you were able to use an autoparallelizer, you should have been able to measure the run time thus the speedup. What speedup did you get? Obviously it isn't going to go faster then Nx where you have N processors, unless what you turned on was auto-vectorization rather than auto-parallelization. - Ira Baxter
I was able to use an autoparallelizer, but it still seating in the Q. I was able to use auto-vectorization on a different system with pgf90 compiler and it seems like i should be able to get results in 51 hours compare to 96 hours before. Thank you. - Hiren
What do you mean with "large scale simulation"? In my understanding this means >1000 threads which is only possible with MPI. - Stefan
May be "large scale simulation" is the wrong terminology and what i meant to say that the simulation is very computer intensive and it could be ran on single node but takes 96 hours to get .25 seconds of actual simulation run and i need to get a steady state value which usually happens after 1 second and that takes about 20 days. I am doing white cell deformation in a shear flow, which has many parameters such as the bonding between the surface and the cell and has 10,000 elements to model the cell (it is combination of CFD and FEA). It takes long time to do the FFT and monte carlo method. - Hiren

2 Answers

1
votes

More than likely, MPI is going to be faster than auto-parallelization. However, auto-parallelization would take about 0.5 seconds worth of work to get a speed-up of, say, 1.2 compared to Y hours (maybe even up to Q weeks) of trial-and-error debugging to get a speed-up of, say, 1.7.

If you're interested in self-learning MPI through a book, Gropp, Lusk, & Skjellum's Using MPI is probably a good start.

0
votes

Answer a bit depends on nature of your hardware and your application/workload. Do you use multi-node cluster (most typical) or big shared memory machine? Assuming you are cluster user, you will have to use MPI or Fortran coarray for (more likely) distributed memory cross-node parallelism AND SOMETHING fon inter-node shared memory parallelism (SMP).

Shared memory parallelism can give you speed-up proportional to number of cores on a node(up to 32x with Xeons) or even more with coprocessors. Distributed memory parallelism can give you speedup proportional to number of nodes. Both types (or actually all 3 types) of parallelism have to be used these days to get reasonable performance. You may think of it like a hierarchy: 1.MPI or coarray on the top, 2.something for shared memory threading in the middle and 3. vectorization in the innermost level.

Well, from your question, it sounds like you are talking mostly about SMP multicore threading parallelism level. This is where -parallel Auto-Parallelization behaves. Dont expect big magic from auto-par. If you want to get better scalable parallelism, you have to try fortran OpenMP or MPI-for-shared memory. I would recommend OpenMP in most cases; its often easier to program and more performance. But. its up to you and you really should think bigger- about all 3 levels of parallelism. If you plan to address all 3 levels, then probably optimal combination (since you are a happy intel fortran user) is 1. MPI for 1st level+ 2. OpenMP for SMP level + 3. AutoVectorization guided by OpenMP 4.0 pragma simd on 3rd level. Im not an expert in coarray, but it might be good alternative to 1.MPI.

My answer does make less sence if you dont deal with classic cluster hardware.