distribution of processes with MPI

Question

My story

I am quite a beginner in parallel programming (I didn't ever do anything more than writing some basic multithreaded things) and I need to parallelize some multithreaded java-code in order to make it run faster. The multithreaded algorithm simply generates threads and passes them to the operating system which does the distribution of threads for me. The results of every thread can be gathered by some collector that also handles synchronisation issues with semaphores etc and calculates the sum of the results of all different threads. The multithreaded code kinda looks like this:

public static void main(String[] args) {
    int numberOfProcesses = Integer.parseInt(args[0]);
    ...
    Collector collector = new Collector(numberOfProcesses);
    while(iterator.hasNext()) {
        Object x = iterator.next();
        new OverwrittenThread(x, collector, otherParameters).start();
    }
    if(collector.isReady())
        System.out.prinltn(collector.getResult());
}

My first idea to convert this to MPI, was the basic way (I guess) to just split up the loop and give every iteration of this loop to another processor like this (with mpiJava):

public static void main(String[args]) {
    ...
    Object[] foo = new Object[number];
    int i = 0;
    while(iterator.hasNext())
        foo[i++] = iterator.next();
    ...
    int myRank = MPI.COMM_WORLD.Rank();
    for(int i = myRank; i < numberOfElementsFromIterator; i += myRank) {
        //Perform code from OverwrittenThread on foo[i]
    }
    MPI.COMM_WORLD.Reduce(..., MPI.SUM, ...);
}

The problems

This is, till now, the only way that I, as newbie in mpi, could make things work. This is only an idea, because I have no idea how to tackle implementation-problems like conversion of BigIntegers to MPI datatypes, etc. (But I would get this far, I guess)

The real problem though, is the fact that this approach of solving the problem, leaves the distribution of work very unbalanced because it doesn't take into account how much work a certain iteration takes. This might really cause some troubles as some iterations can be finished in less than a second and others might need several minutes.

My question

Is there a way to get a similar approach like the multithreaded version in an MPI-implementation? At first I thought it would just be a lot of non-blocking point-to-point communication, but I don't see a way to make it work that way. I also considered using the scatter-functionality, but I have too much troubles understanding how to use it correctly.

Could anybody help me to clear this out, please?
(I do understand basic C etc)

Thanks in advance

Wesley Bland Wesley Bland · Accepted Answer · 2015-02-13T17:16:36

The first thing you need to ask yourself when converting a multi-threaded program to a distributed program is:

What am I trying to accomplish by distributing the data across multiple cores/nodes/etc.?

One of the most common issues people face when getting started with MPI is thinking that they can take a program that works well in a small, shared-memory environment (i.e. multi-threading on a single node) and throw more CPUs at it to make it faster.

Sometimes that is true, but often it's not. The most important thing to remember about MPI is that for the most part (unless you're getting into RMA, which is another advanced topic alltogether), each MPI process has its own separate memory, distinct from all other MPI processes. This is very different from a multi-threaded environment where all threads typically share memory. This means that you add a new problem on top of the other complexities you see with parallel programming. Now you have to consider how to make sure that the data you need to process is in the right place at the right time.

One common way to do this is to ensure that all of the data is already available to all of the other processes outside of MPI, for instance, through a shared filesystem. Then the processes can just figure out what work they should be doing, and get started with their data. Another way is for a single process, often rank 0, to send the important data to the appropriate ranks. There are obviously other ways that you've already discovered to optimize this process. MPI_SCATTER is a great example.

Just remember that it's not necessarily true that MPI is faster than multi-threading, which is faster than single-threading. In fact, sometimes it can be the opposite. The cost of moving your data around via MPI calls can be quite high. Make sure that it's what you actually want to do before trying to rewrite all of your code with MPI.

The only reason that people use MPI isn't just to speed up their code by taking advantage of more processors (though sometimes that's true). Sometimes it's because the problem that their application is trying to solve is too big to fit the in memory of a single node.

All that being said, if your problem really does map to MPI well, you can do what you want to do. Your application appears to be similar to a master/worker kind of job, which is relatively simple to deal with. Just have your master send non-blocking messages to your workers with their work and post a non-blocking MPI_ANY_SOURCE receive so it can be notified when the work is done. When it gets a message from the workers, send out more work to be done.

distribution of processes with MPI

My story

The problems

My question

1 Answers