Where is the global memory replay overhead coming from?

Question

Running the code below to write 1 GB in global memory in the NVIDIA Visual Profiler, I get:
- 100% storage efficiency
- 69.4% (128.6 GB/s) DRAM utilization
- 18.3% total replay overhead
- 18.3% global memory replay overhead.

The memory writes are supposed to be coalesced and there is no divergence in the kernel, so the question is where is the global memory replay overhead coming from? I am running this on Ubuntu 13.04, with nvidia-cuda-toolkit version 5.0.35-4ubuntu1.

#include <cuda.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <ctype.h>
#include <sched.h>
#include <assert.h>

static void
HandleError( cudaError_t err, const char *file, int line )
{
    if (err != cudaSuccess) {
        printf( "%s in %s at line %d\n", cudaGetErrorString(err), file, line);
        exit( EXIT_FAILURE );
    }
}
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))

// Global memory writes
__global__ void
kernel_write(uint32_t *start, uint32_t entries)
{
    uint32_t tid = threadIdx.x + blockIdx.x*blockDim.x;

    while (tid < entries) {
        start[tid] = tid;
        tid += blockDim.x*gridDim.x;
    }
}

int main(int argc, char *argv[])
{
    uint32_t *gpu_mem;               // Memory pointer
    uint32_t n_blocks  = 256;        // Blocks per grid
    uint32_t n_threads = 192;        // Threads per block
    uint32_t n_bytes   = 1073741824; // Transfer size (1 GB)
    float elapsedTime;               // Elapsed write time

    // Allocate 1 GB of memory on the device
    HANDLE_ERROR( cudaMalloc((void **)&gpu_mem, n_bytes) );

    // Create events
    cudaEvent_t start, stop;
    HANDLE_ERROR( cudaEventCreate(&start) );
    HANDLE_ERROR( cudaEventCreate(&stop) );

    // Write to global memory
    HANDLE_ERROR( cudaEventRecord(start, 0) );
    kernel_write<<<n_blocks, n_threads>>>(gpu_mem, n_bytes/4);
    HANDLE_ERROR( cudaGetLastError() );
    HANDLE_ERROR( cudaEventRecord(stop, 0) );
    HANDLE_ERROR( cudaEventSynchronize(stop) );
    HANDLE_ERROR( cudaEventElapsedTime(&elapsedTime, start, stop) );

    // Report exchange time
    printf("#Delay(ms)  BW(GB/s)\n");
    printf("%10.6f  %10.6f\n", elapsedTime, 1e-6*n_bytes/elapsedTime);

    // Destroy events
    HANDLE_ERROR( cudaEventDestroy(start) );
    HANDLE_ERROR( cudaEventDestroy(stop) );

    // Free memory
    HANDLE_ERROR( cudaFree(gpu_mem) );

    return 0;
}

What GPU is this on? If the GPU supports ECC, is it enabled? — njuffa
This is a GeForce GTX 580 and currently. "Device has ECC support: Disabled" Any thoughts? Could this be the issue? — coder

coder coder · Accepted Answer · 2013-06-26T14:19:21

The nvprof profiler and the API profiler are giving different results:

$ nvprof --events gst_request ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms)  BW(GB/s)
 13.345920   80.454690
======== Profiling result:
          Invocations       Avg       Min       Max  Event Name
Device 0
    Kernel: kernel_write(unsigned int*, unsigned int)
                    1   8388608   8388608   8388608  gst_request

$ nvprof --events global_store_transaction ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms)  BW(GB/s)
  9.469216  113.392892
======== Profiling result:
          Invocations       Avg       Min       Max  Event Name
Device 0
    Kernel: kernel_write(unsigned int*, unsigned int)
                    1   8257560   8257560   8257560  global_store_transaction

I had the impression that global_store_transation could not be lower than gst_request. What is going on here? I can't ask for both events in the same command, so I had to run the two separate commands. Could this be the problem?

Strangely, the API profiler shows different results with perfect coalescing. Here is the output, I had to run twice to get the proper counters:

$ cat config.txt
inst_issued
inst_executed
gst_request

$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log.csv COMPUTE_PROFILE_CONFIG=config.txt ./app

$ cat log.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eaca946b8
method,gputime,cputime,occupancy,inst_issued,inst_executed,gst_request,gld_request
_Z12kernel_writePjj,7771.776,7806.000,1.000,4737053,3900426,557058,0

$ cat config2.txt
global_store_transaction

$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log2.csv COMPUTE_PROFILE_CONFIG=config2.txt ./app

$ cat log2.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eea92d0e8
method,gputime,cputime,occupancy,global_store_transaction
_Z12kernel_writePjj,7807.584,7831.000,1.000,557058

Here gst_request and global_store_transactions are exactly the same, showing perfect coalescing. Which one is correct (nvprof or the API profiler)? Why does NVIDIA Visual Profiler says that I have non-coalesced writes? There are still significant instruction replays, and I have no idea where they are coming from :(

Any ideas? I don't think this is hardware malfunctioning, since I have two boards on the same machine and both show the same behavior.

Where is the global memory replay overhead coming from?

1 Answers