Sampling cpu-cycles with perf record is useful for finding optimization candidates if core-utilization is roughly constant. But for code that has multiple phases differing in parallelism counting cpu-cycles will emphasize heavily parallel phases while under-emphasizing sequential or limited-parallelism phases that impact wall-time. In short, naïve perf use may highlight the wrong limb of amdahl's law
So the question is how to get perf record/perf report to find optimization candidates for reducing wall-time which could be anything from the hottest loop in consistently parallel code, over a moderately-parallel bottleneck to a long single-threaded phase.
Known workarounds that leave something to be desired:
- executeing the workload on a single core so that wall-time ≅ cpu-cycles
- profiling individual components separately
meta: this is a perf-specific followup to a more general question

perf record -C 0 ./omp_programorperf report -C 0) - it will partially remove the wrong limb. Second idea - do a diff between main thread and worker thread (-C 1). Third idea: add signalling using trace events into your parallel library and try to use--switch-on/--switch-offof perf-report. Could you add example? - osgx