I would like to profile my parallel code (both mpi and omp)
I found out that Callgrind is very easy to use and analyze (using Kcachegrind) for serial code as it can give you the relative time spent on different functions.
What would it give me when running a parallel code? Would it only monitor the master process or will it sum over all process?
Can it detect deadlocks or place where one process is waiting to another?
Is there a better tool to use when profiling a parallel code?