As it is known that copying data to the GPU is slow I was wondering what specifically "counts" as passing data to the GPU.
__global__
void add_kernel(float* a, float* b, float* c, int size) {
for (int i = 0; i < size; ++i) {
a[i] = b[i] + c[i];
}
int main() {
int size = 100000; //Or any arbitrarily large number
int reps = 1000; //Or any arbitrarily large number
extern float* a; //float* of [size] allocated on the GPU
extern float* b; //float* of [size] allocated on the GPU
extern float* c; //float* of [size] allocated on the GPU
for (int i = 0; i < reps; ++i)
add_kernel<<<blocks, threads>>>(a, b, c, size);
}
Does something such as passing size to the kernel incur (significant) overhead? Or does "data transfers" refer more specically to copying large arrays from the heap to the GPU.
IE would this variant be (much) faster
__global__
void add_kernel(float* a, float* b, float* c, int size, int reps) {
for (int j = 0; i < reps; ++j)
for (int i = 0; i < size; ++i) {
a[i] = b[i] + c[i];
}
int main() {
int size = 100000; //Or any arbitrarily large number
int reps = 1000; //Or any arbitrarily large number
extern float* a; //float* of [size] allocated on the GPU
extern float* b; //float* of [size] allocated on the GPU
extern float* c; //float* of [size] allocated on the GPU
add_kernel<<<blocks, threads>>>(a, b, c, size, reps);
}
IE (again) in "ideal" CUDA programs should programmers be attemping to write a large majority of the computational programs in purely CUDA kernels or write CUDA kernels that are then called from the CPU (in the instance that passing values from the stack does not incur significant overhead).
Edited for clarity.
CPUand the other inside theKernelIs this significant or can I expect the compiler to optimize this away? - Joseph Franciscusa,bandclocated in the device memory? It looks like they're in the host memory. Then every function call would require a copy from the host memory to the device memory (unless the compiler is really smart). In that case the second example will be faster yes: 1 function call against 1000 function calls. In the case the variables are already in device memory, the overhead of a function call withsizeis not significant. These kind of micro-optimizations is not where you get the big gains. - JHBonarius//cudaMallocto infer they're on thedevice. Sorry if that's unclear. *Edited to clarify they are already on the gpu. - Joseph Franciscus