3
votes

I'm trying to profile some C code but one of the most intuitively costly functions isn't showing up in the GProf output.

int main() {
    initialise...
    haloSwap();

    for(...) {
        functions...

        propagate();

        functions...
    }
}

void propagate() {
    for (x)
        for (y)
            for (z)
                grid[xNew][yNew][zNew] = grid[x][y][z];

    haloSwap();
}

void haloSwap() {
    // Horizontal swap
    create buffers...
    MPI_Sendrecv(buffers);
    recreate grid from buffers...

    // Vertical swap
    create buffers...
    MPI_Sendrecv(buffers);
    recreate grid from buffers...
}

Hopefully that pseudo-code goes some way to explaining the set up. haloSwap() involves a lot of communication between threads and I feel it's an expensive part of the algorithm. It's called during initialisation and then repeatedly during the loop of the algorithm.

GProf shows only 1 call to haloSwap (during init), even though I know it's called 1000+ times from inside propagate().

propagate() is showing as the most expensive part of the code, but I'd like to know whether it's the xyz loop(s) or the MPI comminucation.

Does anyone know why the calls to haloSwap from propagate are seemingly ignored in both the number of calls and the time spent in the function?

haloSwap is defined within another .c file, which may be a factor?

If I move the call of haloSwap to the main loop after calling propagate (instead of inside it), GProf still only shows 1 call to it.

2
You are probably compiling with optimisations and propagate() ends up being inlined. Tell your compiler to not inline functions. For example, with GCC the option is -fno-inline.Hristo Iliev
add to @Hristo_Iliev comment. mpicc (from MPICH )wrapper uses the -O2 optimization which includes the -finline-small-functions flag . If there are functions that you want to be excluded from any optimizations use the gcc attributes on functions for example void haloSwap() __attribute__(optimize("-O0"))Angelos
If you're using MPI, then it's doing I/O, and gprof is well-known to be blind to that, because sampling is suspended during I/O. You could be spending 99% of the time in I/O, but gprof would not show it. The same is true of anything calling itself a "CPU profiler".Mike Dunlavey
@MikeDunlavey I suppose a poor man's way to get an idea of that is to check real elapsed time versus gprof reported CPU time?Kallum Burgin
@KallumBurgin: That would show the issue. Some people use this technique which works on real time, not just CPU time. It has less precision on timing, but more precision for locating problems.Mike Dunlavey

2 Answers

3
votes

Profiling with optimizations enabled is absolutely not a stupid mistake. It is instead the way one should profile -- otherwise you'll be wasting time and effort hand-optimizing routines the compiler could have fixed for you.

If you need a different profile report than what gprof can provide, the google performance tools (aka gperftools) might be useful to you. The reports will record time spent in an inline function. Here's an example report where you can see how the google tools report inline functions:

% ~/soft/gperftools/bin/pprof --text \
    bigio_mpi bigio-profiling/prof-ca-fill-inline1-0.out| head -10
Using local file bigio_mpi.
Using local file bigio-profiling/prof-ca-fill-inline1-0.out.
Total: 29680 samples
10837  36.5%  36.5%    24056  81.1% ADIOI_P2PContigReadAggregation
3534  11.9%  48.4%     3534  11.9% _init@3a858
2954  10.0%  58.4%     2954  10.0% unshuffle_sse2
2448   8.2%  66.6%     2448   8.2% __memcpy_ssse3
1742   5.9%  72.5%     2661   9.0% type_create_contiguous_x (inline)
1373   4.6%  77.1%     1373   4.6% shuffle_sse2
1176   4.0%  81.1%     6205  20.9% ADIOI_Type_create_hindexed_x
1104   3.7%  84.8%    19082  64.3% ADIOI_Get_eof_offset
831   2.8%  87.6%    28184  95.0% ADIOI_Flatten
2
votes

@Hristo_Iliev and @Angelos were correct, I was compiling with optimisations and my problem was solved by compiling with -O0. Stupid mistake, I'm sure I've made it before.