0
votes

As it is known that copying data to the GPU is slow I was wondering what specifically "counts" as passing data to the GPU.

__global__
void add_kernel(float* a, float* b, float* c, int size) {
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }

int main() {
int size = 100000; //Or any arbitrarily large number
int reps = 1000;   //Or any arbitrarily large number


extern float* a; //float* of [size] allocated on the GPU 
extern float* b; //float* of [size] allocated on the GPU 
extern float* c; //float* of [size] allocated on the GPU 

 for (int i = 0; i < reps; ++i)
add_kernel<<<blocks, threads>>>(a, b, c, size); 

}

Does something such as passing size to the kernel incur (significant) overhead? Or does "data transfers" refer more specically to copying large arrays from the heap to the GPU.

IE would this variant be (much) faster

__global__
void add_kernel(float* a, float* b, float* c, int size, int reps) {
for (int j = 0; i < reps; ++j)
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }

int main() {
int size = 100000; //Or any arbitrarily large number
int reps = 1000; //Or any arbitrarily large number

extern float* a; //float* of [size] allocated on the GPU 
extern float* b; //float* of [size] allocated on the GPU 
extern float* c; //float* of [size] allocated on the GPU 

add_kernel<<<blocks, threads>>>(a, b, c, size, reps); 
}

IE (again) in "ideal" CUDA programs should programmers be attemping to write a large majority of the computational programs in purely CUDA kernels or write CUDA kernels that are then called from the CPU (in the instance that passing values from the stack does not incur significant overhead).

Edited for clarity.

3
In both cases you are "transferring" 28 bytes. There is no significant overhead here - talonmies
Well one I have the outer ("repetition") loop in the CPU and the other inside the Kernel Is this significant or can I expect the compiler to optimize this away? - Joseph Franciscus
are a, b and c located in the device memory? It looks like they're in the host memory. Then every function call would require a copy from the host memory to the device memory (unless the compiler is really smart). In that case the second example will be faster yes: 1 function call against 1000 function calls. In the case the variables are already in device memory, the overhead of a function call with size is not significant. These kind of micro-optimizations is not where you get the big gains. - JHBonarius
I used //cudaMalloc to infer they're on the device. Sorry if that's unclear. *Edited to clarify they are already on the gpu. - Joseph Franciscus

3 Answers

2
votes

Everything counts. In order to run the kernel CPU needs to pass somehow which kernel to call and with which parameters. On "micro-level", if your kernel performs only several operations, these are considerable expenses. In real life, if your kernels do a lot of work, they are neglible.

And relatively big service expenses can be if such small operations are not pipelined. You can see this in NVidia's Visual Profiler. I don't know/remember exact numbers, but the order is following. Bandwidth between CPU and GPU can be like 1 GB/s, so 1 byte/nanosecond. But actually to send 4 bytes packet and to get acknowledge will take something like 1 microsecond. So to send 10000 bytes - like 11 microseconds.

Also execution of operations are optimized for massive execution on GPU, so execution of 10 consecutive operations with one 32 threads warp can take like 200 GPU clock cycles (like 0.2 microsecond). And say 0.5 microsecond for sending command for kernel execution before it will start.

In real life the problem is usually in that to sum 100 million of numbers you'll spend 0.4 seconds because of bandwidth limitation and say 0.1 microsecond for calculation itself. Because top GPU can perform about 1000 operations in each cycle near 1 nanosecond long.

0
votes

Hi I have benchmarked the two versions. Simply calling CUDA functions DOES have a noticeable overhead

This is the output --

 Calculating... (BlackCat_Tensors) reps outside
It took me 27.359249 clicks (27.359249 seconds).

 Calculating... (BlackCat_Tensors) reps inside 
It took me 10.855168 clicks (10.855168 seconds).

This is my benchmark --

/*
 * test_area.cu
 *
 *  Created on: Jan 11, 2018
 *      Author: joseph
 */

#ifndef TEST_AREA_CU_
#define TEST_AREA_CU_

#include <omp.h>
#include <stdio.h>


int threads() {
    return 256;
}
int blocks(int size) {
    return (size + threads() - 1) / threads();
}

__global__
void add_kernel(float* a, float* b, float* c, int size) {
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }
}


__global__
void add_kernel(float* a, float* b, float* c, int size, int reps) {
    for (int j = 0; j < reps; ++j)
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }
}

int main() {
int sz = 10000; //Or any arbitrarily large number
int reps = 10000;   //Or any arbitrarily large number

float* a; //float* of [size] allocated on the GPU
 float* b; //float* of [size] allocated on the GPU
 float* c; //flo

cudaMalloc((void**)&a, sizeof(float) * sz);
cudaMalloc((void**)&b, sizeof(float) * sz);
cudaMalloc((void**)&c, sizeof(float) * sz);


float t = omp_get_wtime();
printf("\n Calculating... (BlackCat_Tensors) reps outside\n");

for (int i = 0; i < reps; ++i) {
add_kernel<<<blocks(sz), threads()>>>(a, b, c, sz);
cudaDeviceSynchronize();
}
t = omp_get_wtime() - t;
printf("It took me %f clicks (%f seconds).\n", t, ((float) t));


 t = omp_get_wtime();
printf("\n Calculating... (BlackCat_Tensors) reps inside \n");

add_kernel<<<blocks(sz), threads()>>>(a, b, c, sz, reps);
cudaDeviceSynchronize();


t = omp_get_wtime() - t;
printf("It took me %f clicks (%f seconds).\n", t, ((float) t));





 cudaFree(a);
 cudaFree(b);
 cudaFree(c);
}


#endif /* TEST_AREA_CU_ */
0
votes

Here's a secondary benchmark: I imagine the threads for the inside loop could be higher as it is calculating more and ergo should have an even greater disparity in performance.

/*
 * test_area.cu
 *
 *  Created on: Jan 11, 2018
 *      Author: joseph
 */

#ifndef TEST_AREA_CU_
#define TEST_AREA_CU_

#include <omp.h>
#include <stdio.h>


int threads() {
    return 256;
}
int blocks(int size) {
    return (size + threads() - 1) / threads();
}

__global__
void add_kernel(float* a, float* b, float* c, int size) {
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }
}


__global__
void add_kernel(float* a, float* b, float* c, int size, int reps) {
    for (int j = 0; j < reps; ++j)
   for (int i = 0; i < size; ++i) {
       a[i] = b[i] + c[i];
   }
}

int main() {
int sz = 10000; //Or any arbitrarily large number
int reps = 1000;   //Or any arbitrarily large number

float* a; //float* of [size] allocated on the GPU
 float* b; //float* of [size] allocated on the GPU
 float* c; //flo

cudaMalloc((void**)&a, sizeof(float) * sz);
cudaMalloc((void**)&b, sizeof(float) * sz);
cudaMalloc((void**)&c, sizeof(float) * sz);


float t = omp_get_wtime();
printf("\n Calculating... (BlackCat_Tensors) reps outside\n");

for (int i = 0; i < reps; ++i) {
add_kernel<<<blocks(sz), threads()>>>(a, b, c, sz);
cudaDeviceSynchronize();
}
t = omp_get_wtime() - t;
printf("It took me %f clicks (%f seconds).\n", t, ((float) t));


 t = omp_get_wtime();
printf("\n Calculating... (BlackCat_Tensors) reps inside \n");

add_kernel<<<blocks(sz), threads()>>>(a, b, c, sz, reps);
cudaDeviceSynchronize();


t = omp_get_wtime() - t;
printf("It took me %f clicks (%f seconds).\n", t, ((float) t));





 cudaFree(a);
 cudaFree(b);
 cudaFree(c);
}


#endif /* TEST_AREA_CU_ */



 Calculating... (BlackCat_Tensors) reps outside
It took me 14.969501 clicks (14.969501 seconds).

 Calculating... (BlackCat_Tensors) reps inside 
It took me 13.060688 clicks (13.060688 seconds).