I am trying to read performance counters with nvprof while executing two kernels concurrently.
nvprof --concurrent-kernels on --events fb_subp0_write_sectors ./myprogram
However by doing this the kernel execution seems to serialize. What I want out of this is exactly how they perform when they are running concurrently.
Is it possible at all to read performance counters when kernels are running concurrently? I do not necessarily need performance per kernel, aggregate data is perfectly fine.
I am running on a Kepler gpu with compute 3.5.