5
votes

I just started learning CUDA and I have a trouble interpreting my experiment results. I wanted to compare CPU vs GPU in a simple program that adds two vectors together. The code is following:

__global__ void add(int *a, int *b, int *c, long long n) {
    long long tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        c[tid] = a[tid] + b[tid];
    }
}

void add_cpu(int* a, int* b, int* c, long long n) {
    for (long long i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

void check_results(int* gpu, int* cpu, long long n) {
    for (long long i = 0; i < n; i++) {
        if (gpu[i] != cpu[i]) {
            printf("Different results!\n");
            return;
        }
    }
}

int main(int argc, char* argv[]) {
    long long n = atoll(argv[1]);
    int num_of_blocks = atoi(argv[2]);
    int num_of_threads = atoi(argv[3]);

    int* a = new int[n];
    int* b = new int[n]; 
    int* c = new int[n]; 
    int* c_cpu = new int[n];
    int *dev_a, *dev_b, *dev_c;
    cudaMalloc((void **) &dev_a, n * sizeof(int));
    cudaMalloc((void **) &dev_b, n * sizeof(int));
    cudaMalloc((void **) &dev_c, n * sizeof(int));
    for (long long i = 0; i < n; i++) {
        a[i] = i;
        b[i] = i * 2;
    }

    cudaMemcpy(dev_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);

    StopWatchInterface *timer=NULL;
    sdkCreateTimer(&timer);
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);  

    add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);

    cudaDeviceSynchronize();
    sdkStopTimer(&timer);
    float time = sdkGetTimerValue(&timer);
    sdkDeleteTimer(&timer);

    cudaMemcpy(c, dev_c, n * sizeof(int), cudaMemcpyDeviceToHost);
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);

    clock_t start = clock();
    add_cpu(a, b, c_cpu, n);
    clock_t end = clock();

    check_results(c, c_cpu, n);
    printf("%f %f\n", (double)(end - start) * 1000 / CLOCKS_PER_SEC, time);

    return 0;
}

I ran this code in a loop with a bash script:

for i in {1..2560}
do
    n="$((1024 * i))"
    out=`./vectors $n $i 1024`
    echo "$i $out" >> "./vectors.txt"
done

Where 2560 is maximum number of blocks that my GPU supports, and 1024 is the maximum number of threads in block. So I just ran it for maximum block size to the maximum problem size my GPU can handle, with a step of 1 block (1024 ints in vector).

Here is my GPU info:

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2070 SUPER"
  CUDA Driver Version / Runtime Version          11.3 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (040) Multiprocessors, (064) CUDA Cores/MP:    2560 CUDA Cores
  GPU Max Clock rate:                            1785 MHz (1.78 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

After running the experiment I gathered the results and plotted them: Relation between execution time and vector size

So what bothers me is this 256 blocks-wide period in the GPU execution time. I have no clue why this happens. Why executing 512 blocks is much slower than executing 513 blocks of threads?

I also checked this with a constant number of blocks (2560) as well as with different block sizes and it always give this period of 256 * 1024 vector size (so for block size 512 its each 512 blocks, not each 256 blocks). So maybe this is something with memory, but I can't figure out what.

I would appreciate any ideas on why this is happening.

1
Did you enabled optimization? The numbers indicate a low memory throughput for the CPU implementation, close to 6 Go/s (for Intel processors). Alternatively, it may be due to the page faults since c_cpu is not read/written before. Furthermore, why do you say "2560 is maximum number of blocks that my GPU supports" ? This GPU should be able to support much more block per kernel. According to this page, the limit should be very high in this case.Jérôme Richard
@JérômeRichard I use nvcc -I cuda-samples/Common lab1/vectors.cu -o vectors to compile it, I have no idea if I use optimization or not, I am completely new to CUDA. Regarding 2560 blocks - yep I think I misunderstood how it works, so now if I understand correctly I have 128 * 2560 concurrent blocks that I can use?zuroslav
I think nvcc optimize the CUDA kernel by default, but AFAIK not the CPU code. You should add the flag -O3 for the CPU code to be fast. For the block: no, 2560 CUDA core can work simultaneously split in 40 multi-processors (ie. SM). So, 2560 threads can actively run in parallel. You can find more information here. You can have a lot of block but not all may be computed simultaneously (and this is fine). You can find more information about how blocks are computed here.Jérôme Richard

1 Answers

7
votes

This is by no means a complete or precise answer. However I believe the periodic pattern you are observing is at least partly due to a 1-time or first-time kernel launch overhead. Good benchmarking practice usually is to do something other than what you are doing. For example, run the kernel multiple times and take an average. Or do some other kind of statistical measurement.

When I run your code using your script on a GTX 960 GPU, I get the following graph (only plotting the GPU data, vertical axis is in milliseconds):

without warm-up

When I modify your code as follows:

cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
// next two lines added:
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();

StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);  

add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);

cudaDeviceSynchronize();

Doing a "warm-up" run first, then timing the second run, I witness data like this:

with warm-up

So the data without the warm-up shows a periodicity. After the warm-up, the periodicity disappears. I conclude that the periodicity is due to some kind of 1-time or first-time behavior. Some typical things that might be in this category are caching effects and cuda "lazy" initialization effects (for example, the time taken to JIT-compile the GPU code, which is certainly happening in your case, or the time to load the GPU code into GPU memory). I won't be able to go farther with any explanation of what kind of first-time effect exactly is giving rise to the periodicity.

Another observation is that while my data shows an expected "average slope" to each graph, indicating that the kernel duration associated with 2560 blocks is approximately 5 times the kernel duration associated with 512 blocks, I don't see that kind of trend in your data. It ought to be there, however. Your GPU will "saturate" at about 40 blocks. Thereafter, the average kernel duration should increase in approximately a linear fashion, such that the kernel duration associated with 2560 blocks is 4-5x the kernel duration associated with 512 blocks. I can't explain your data in this respect at all, I suspect a graphing or data processing error, or else a characteristic in your environment (e.g. shared GPU with other users, broken CUDA install, etc.) that is not present in my environment, and which I'm unable to guess at.

Finally, my conclusion is that GPU "expected" behavior is more evident in the presence of good benchmarking techniques.