
I've prepared a .pro file for use Qt and CUDA in a linux machine (64bits). When I run the application into the CUDA profiler, the app executes 12 times but before present the results i get the next error:

Error in profiler data file '/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv' at line number 6 for column 'memory transfer size.

The main.cpp file is as simple as

#include <QtCore/QCoreApplication> 
extern "C"
void runCudaPart();

int main(int argc, char *argv[])
    QCoreApplication a(argc, argv);
    return 0;

The fact is that if i remove the "QCoreApplication a(argc, argv);" line the CUDA Visual Profiler works as excepted and show all the results.

I've checked that the cuda_profile.log is generated from the command line if i export the CUDA_PROFILE=1 environment variable. The comma-separated file is also generated if i export the COMPUTE_PROFILE_CSV=1 variale but the CUDA Visual Profiler crashes when i try to import that file.

Any hints about this issue? It seems something related to the CUDA visual Profiler application not with the code.

If you are wondering why i did a so simple main.cpp with Qt but without using Qt :P is that i would like improve the framework in the future to add a GUI.

// details of CUDA, GPU, OS, QT, and compiler versions

  Device"GeForce GTX 480"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    2.0
  OS: ubuntu 10.04 LTS
  QT_VERSION: 263682
  gcc version 4.4.3
  nvcc compilation tool, release 3.2, V0.2.122

I've noticed that the problem is with the QCoreApplication construct. It does something with the arguments. If I modify the line as:

QCoreApplication a();

the Visual Profiler works as excepted. Hard to know what is happening and if this change will be a problem in the future. Any hints?

Regarding to the QCoreApplication construct the example also work if I call the cuda part before the QCoreApplication.

// this way the example works.
QCoreApplication a(argc, argv);

Thanks in advance.

The application invokes a kernel launch and memory transfers. The application releases all resources. The application exits normally (return cudaThreadExit()). Any hints. Thanks!pQB
Did you check the csv file? Is it correct or corrupted? You could try opening it in Excel or something similar.Bart
@Bart The csv file is fine. I can open it with OpenOffice Calc or any text editor. Thanks!pQB
It's worth noting that the CUDA visual profiler is built on Qt. Would it be possible to try with CUDA 4.0 to verify whether this still occurs? If so, I would suggest filing an NVIDIA bug report (you'll need to join the CUDA registered developer program ).harrism
I have sent the CUDA visual profiler team a link to this issue so they can investigate.harrism

2 Answers


I can't reproduce this with CUDA 3.2 and QT4 on a 64 bit Ubuntu 10.04LTS system. I took this main:

#include <QtCore/QCoreApplication>

extern float cudamain();

int main(int argc, char *argv[])
    QCoreApplication a(argc, argv);

    float gflops = cudamain();

    return 0;

and a cudamain() containing this:

#include <assert.h>

#define blocksize 16
#define HM (4096) 
#define WM (4096) 
#define WN (4096)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
    __shared__ float MS[blocksize][blocksize];
    __shared__ float NS[blocksize][blocksize];

    int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
    int rowM=ty+by*blocksize;
    int colN=tx+bx*blocksize;
    float Pvalue=0;

    for(int m=0; m<uWM; m+=blocksize){
        MS[ty][tx]=M[rowM*uWM+(m+tx)] ;
        NS[ty][tx]=M[colN + uWN*(m+ty)];
        for(int k=0;k<blocksize;k++)

inline void gpuerrorchk(cudaError_t state)
    assert(state == cudaSuccess);

float cudamain(){

    cudaEvent_t evstart, evstop;


    for(int i=0;i<WM*HM;i++)
    for(int i=0;i<WN*HN;i++)


    float *Md,*Nd,*Pd;
    gpuerrorchk( cudaMalloc((void**)&Md,HM*WM*sizeof(float)) );
    gpuerrorchk( cudaMalloc((void**)&Nd,HN*WN*sizeof(float)) );
    gpuerrorchk( cudaMalloc((void**)&Pd,HP*WP*sizeof(float)) );

    gpuerrorchk( cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice) );
    gpuerrorchk( cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice) );

    dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);
    dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

    gpuerrorchk( cudaEventRecord(evstart,0) );

    nonsquare<<<dimGrid,dimBlock>>>(Md,Nd,Pd,WM, WN);
    gpuerrorchk( cudaPeekAtLastError() );

    gpuerrorchk( cudaEventRecord(evstop,0) );
    gpuerrorchk( cudaEventSynchronize(evstop) );
    float time;

    gpuerrorchk( cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost) );


    float gflops=(2.e-6*WM*WM*WM)/(time);


    return gflops;


(pay no attention to the actual code other than it doing memory transactions and running a kernel, it is nonsense otherwise).

Compiling the code like this:

cuda:~$ nvcc -arch=sm_20 -c -o cudamain.o cudamain.cu 
cuda:~$ g++ -o qtprob -I/usr/include/qt4 qtprob.cc cudamain.o -L $CUDA_INSTALL_PATH/lib64 -lQtCore -lcuda -lcudart
cuda:~$ ldd qtprob
        linux-vdso.so.1 =>  (0x00007fff242c8000)
        libQtCore.so.4 => /opt/cuda-3.2/computeprof/bin/libQtCore.so.4 (0x00007fbe62344000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007fbe61a3d000)
        libcudart.so.3 => /opt/cuda-3.2/lib64/libcudart.so.3 (0x00007fbe617ef000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fbe614db000)
        libm.so.6 => /lib/libm.so.6 (0x00007fbe61258000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fbe61040000)
        libc.so.6 => /lib/libc.so.6 (0x00007fbe60cbd000)
        libz.so.1 => /lib/libz.so.1 (0x00007fbe60aa6000)
        libgthread-2.0.so.0 => /usr/lib/libgthread-2.0.so.0 (0x00007fbe608a0000)
        libglib-2.0.so.0 => /lib/libglib-2.0.so.0 (0x00007fbe605c2000)
        librt.so.1 => /lib/librt.so.1 (0x00007fbe603ba000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00007fbe6019c000)
        libdl.so.2 => /lib/libdl.so.2 (0x00007fbe5ff98000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbe626c0000)
        libpcre.so.3 => /lib/libpcre.so.3 (0x00007fbe5fd69000)

produces an executable which profiles without error as many times as I care to run it with the CUDA 3.2 release profiler.

All I can suggest is try my repro case and see whether it works or not. If it fails, then perhaps you have either a broken CUDA or QT installation. If it doesn't fail (and I suspect it won't), then you either have a problem with the way you are building the QT project or the actual CUDA code you are running itself.


@pQB Hello, I am Ramesh from NVIDIA. We could not reproduce this issue locally here. This kind of error comes when the value for that column is either empty or invalid. In your case (Error in profiler data file '/home/myusername/development/qtspace/bin/temp_compute_profiler_0_0.csv' at line number 6 for column 'memory transfer size) the value for column ‘memory transfer size’ is either empty or invalid for line no. 6 in the csv file.

Can you send ‘temp_compute_profiler_0_0.csv’ if it is present in you working directory and the csv generated by command line profiler. If it is not possible check what value you getting for that column (memory transfer size) in line no. 6.

Are you running your app with default settings in Visual Profiler? Can you try running your app disabling 'memory transfer size' option? To disable this option click menu “Session->Session Settings…”, on session settings dialog click “Other Options” tab, uncheck “memory transfer size”