0
votes

I am interested in getting memory performance counter of concurrent cuda kernels. I tried to use several nvprof options like, --metrics all and --print-gpu-trace. The output seems to indicate that kernels are not concurrent any more. And concurrent performance metrics of each kernel look almost exactly the same as those running each kernel alone. I think that these concurrent kernels ran in sequence. How could I get memory performance metrics counter of concurrent kernels, for example L2 cache?

1
see here "When you attempt to profile a metric or event with nvprof, all the concurrent kernels in the application are serialized" So its a limitation of nvprof, currently. - Robert Crovella
@RobertCrovella Thank you Robert. Is there any way to get concurrent kernels' performance metrics? - palebluedot
I don't know of a way. - Robert Crovella
Also, this is mentioned in the documentation here under "Metrics and Events". That is admittedly a section on remote profiling, but the statement is true even in the ordinary case. - Robert Crovella

1 Answers

1
votes

You cannot do per-kernel profiling while having the kernels execute concurrently. You can however try the following workarounds:

  1. Do only tracing. If you don't specify --metrics or --events, nvprof will only do a tracing run. In this case, nvprof will run the kernels concurrently, but you will only get kernel timings - not metric/event data.
  2. If you own an NVIDIA Tesla GPU (as opposed to GeForce or Quadro), you can use the CUPTI library's cuptiSetEventCollectionMode(CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS) API to sample the metrics you want while the kernels are running concurrently. However, this will only allow you to get the aggregate metric/event data in that sampling interval - which means that you will not be able to correlate this data to individual kernels. CUPTI ships with a code sample called event_sampling, that demonstrates how to use this API.
  3. Profile the metrics/events you want, and let the kernels serialize. For some metrics/events, you may be able to simply sum up the values to estimate behavior during concurrent execution.