MPI: time for output increases when the number of processors increase

Question

I have a problem printing a sparse matrix in a c++/mpi program that I hope you could help me solve.

Problem: I need to print a sparse matrix as a list of 3-ples (x, y, v_xy) in a .txt file in a program that has been parallelized with MPI. Since I am new to MPI, I decided not to deal with the parallelized IO instructions provided by the library and let the master processor (0 in my case) print the output. However, the time for printing the matrix increases when I increase the number of processors:

1 processor: 11,7 secs
2 processors: 26,4 secs
4 processors: 25,4 secs

I have already verified that the output is exactly the same in the three cases. Here is the relevant section of the code:

if (rank == 0)
{    
    sw.start();

    std::ofstream ofs_output(output_file);
    targets.print(ofs_output);
    ofs_output.close();

    sw.stop();
    time_output = sw.get_duration();
    std::cout << time_output << std::endl;
}

My stopwatch sw is measuring wall clock time using the gettimeofday function. The print method for the targets matrix is the following:

void sparse_matrix::print(std::ofstream &ofs)
{
    int temp_row;
    for (const_iterator iter_row = _matrix.begin(); iter_row != _matrix.end(); ++iter_row)
    {
        temp_row = (*iter_row).get_key();
        for (value_type::const_iterator iter_col = (*iter_row).get_value().begin();
        iter_col != (*iter_row).get_value().end(); ++iter_col)
        {
            ofs << temp_row << "," << (*iter_col).get_key() << "," << (*iter_col).get_value() << std::endl;
        }
    }
}

I do not understand what is causing the slow-down since only processor 0 does the output and this is the very last operation of the program: all the other processors are done while processor 0 prints the output. Do you have any idea?

Are these actually three different machines or are you testing that code by pretending to have three processors on one? — stefan
@stefan: I am using a i7 quad-core processors (dell XPS 15). I forgot to mention that I am executing the code on a Oracle linux virtualbox to which I allocated 4 processors in the settings. I can't figure out the dependence of the print execution time wrt to the number of processors, since only processor 0 executes the print instruction. — Pierpaolo Necchi
On your virtual box, switching to more than one processor will create an overhead which slows down your system. The conditions of your measurement experience are hence not relevant ! By the way, gettimeofday() is not the best function to measure performance (see linux.die.net/man/2/gettimeofday under the heading "Notes"). — Christophe
@Christophe: Thank you for the useful reference. I will keep it in mind. Concerning the slow-down, is the overhead present only when I execute my code using mpirun -np k with k >= 2? In this case, do you think that allocating more RAM to the VM will improve output performances? I am currently using 4Gb on the VM out of 16Gb available on my laptop. Thank you for your help — Pierpaolo Necchi
Yes, this slow-down looks terrible. When you use more core, each core is a little bit slower, but the overall throughput of the processor is higher. The problem with virtual processors is that each environment is more than just a thread running on a core. Have you tried running your MPI without the WM ? Normally, the MPI should be able to take advantage of native cores without overhead. Look also here: stackoverflow.com/questions/5797615/mpi-cores-or-processors — Christophe

Pierpaolo Necchi Pierpaolo Necchi · Accepted Answer · 2015-02-11T12:03:21

Well, I finally understood what was causing the problem. Running my program, parallelized on MPI, on a linux virtual machine drastically increased the time for printing a large amount of data in a .txt file when increasing the number of cores used. The problem is caused by the virtual machine, which does not behave correctly when using MPI. I tested the same program on a physical 8-core machine and the time for printing the output does not increase with the number of cores used.

MPI: time for output increases when the number of processors increase

1 Answers