I am interested in getting memory performance counter of concurrent cuda kernels. I tried to use several nvprof options like, --metrics all and --print-gpu-trace. The output seems to indicate that kernels are not concurrent any more. And concurrent performance metrics of each kernel look almost exactly the same as those running each kernel alone. I think that these concurrent kernels ran in sequence. How could I get memory performance metrics counter of concurrent kernels, for example L2 cache?
0
votes
1 Answers
1
votes
You cannot do per-kernel profiling while having the kernels execute concurrently. You can however try the following workarounds:
- Do only tracing. If you don't specify
--metricsor--events, nvprof will only do a tracing run. In this case, nvprof will run the kernels concurrently, but you will only get kernel timings - not metric/event data. - If you own an NVIDIA Tesla GPU (as opposed to GeForce or Quadro), you can use the CUPTI library's
cuptiSetEventCollectionMode(CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS)API to sample the metrics you want while the kernels are running concurrently. However, this will only allow you to get the aggregate metric/event data in that sampling interval - which means that you will not be able to correlate this data to individual kernels. CUPTI ships with a code sample calledevent_sampling, that demonstrates how to use this API. - Profile the metrics/events you want, and let the kernels serialize. For some metrics/events, you may be able to simply sum up the values to estimate behavior during concurrent execution.
nvprof, currently. - Robert Crovella