0
votes

I would like to profile the training loop of a transformer model written in Tensorflow on a multi-GPU system. Since the code doesn't support tf2, I cannot use the built-in but experimental profiler. Therefore, I would like to use nvprof + nvvp (CUDA 10.1, driver: 418).

I can profile the code without any errors, however, when examining the results in nvvp, there is no data for the GPUs. I don't know what causes this, as nvidia-smi clearly shows that the GPUs are utilized.

This thread seems to describe the same issue, but there is no solution. Following the suggestions in this question, I ran cuda-memcheck on the code, which yielded no errors.

I have tried running nvprof with additional command line arguments, such as --analysis-metrics (no difference) and --profile-child-processes (warns that it cannot capture GPU data), to no avail.

Could someone please help me understand why I cannot capture GPU data and how I can fix this?

Also, why are there so few resources on profiling deep neural networks? It seems that with long training times it is especially important to make sure to capitalize on all computing resources.

Thank you!

1

1 Answers

0
votes

Consider add command line arguments --unified-memory-profiling off.