kernel timed out for large array when X Server is on

Question

I am launching my kernel and checking for possible errors as follows:

kernel<<<grid,block>>>(d_Basis, d_repul_aux,nao);
  cout<<"done with the ERIs...."<<endl;
  std::string error = cudaGetErrorString(cudaPeekAtLastError());
  cout<<error<<endl;

HANDLE_ERROR(cudaMemcpy(eris_gpu_cpu_aux.data(),d_repulsion_aux,eris_size*sizeof(double),cudaMemcpyDeviceToHost));

where cudaGetErrorString(cudaPeekAtLastError()) is used in order to do error checking for the kernel and I have defined:

static void HandleError( cudaError_t err,
                         const char *file,
                         int line ) {
  if (err != cudaSuccess) {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
            file, line );
    exit( EXIT_FAILURE );
  }
}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

When the X server is off, the computation runs as spected; but if I turn on the X server, the kernel hangs out and I get the following output:

done with the ERIs....
no error
the launch timed out and was terminated in main.cu at line 1038

The line 1038 in the source code corresponds to:

HANDLE_ERROR(cudaMemcpy(eris_gpu_cpu_aux.data(),d_repulsion_aux,eris_size*sizeof(double),cudaMemcpyDeviceToHost));

What means that the computation crashes when we are copying the results from the device to the host. I am using a graphic card GEforce GTx-480, and CUDA 7.5 .

Attempting to solve the problem, I tried to turn off the "Interactive" option in the /etc/X11/xorg.conf file but the X server does not recognize this option. What can I do in order to share the GPU resources between the X Server and my GPGPU application? I insist on this because is unconfortable for me to write and/or debug my code using text mode enviroment.

The error in the cudaMemcpy following the kernel is actually a failure of the kernel to complete successfully. You should follow the instructions here. Your statements about the X server not recognizing that option don't make sense. That option is handled by the NVIDIA GPU driver, not the X server. So my guess would be that you did not modify the xorg.conf correctly, or else you modified an xorg.conf file that your X-server is not even using for display configuration. — Robert Crovella
Thanks a lot for your advice, in fact, I used the wrong sintaxys to turn off the interactive option, an example of the right syntaxis can be found here. — user3116936
If you want to provide an answer showing a snippet of the changes you made to your xorg.conf, it would probably be useful for other readers. I would upvote. — Robert Crovella

user3116936 user3116936 · Accepted Answer · 2016-01-07T20:00:35

My previous /etc/X11/xorg.conf file were as follows:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 319.21  (buildmeister@swio-display-x86-rhel47-14)  Sun May 12 00:46:48 PDT 2013


Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

In order to solve the problem we have to disable watchdog timeouts as follows:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 319.21  (buildmeister@swio-display-x86-rhel47-14)  Sun May 12 00:46:48 PDT 2013


Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
##
##  disable watchdog timeouts for long-running CUDA kernels
##
    Option "Interactive" "false"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

kernel timed out for large array when X Server is on

1 Answers