I have experienced that the CPU executes faster than the GPU for small input sizes. Why is this? Preparation, data transfer or what?
For example for the kernel and CPU function(CUDA code):
__global__ void squareGPU(float* d_in, float* d_out, unsigned int N) {
unsigned int lid = threadIdx.x;
unsigned int gid = blockIdx.x*blockDim.x+lid;
if(gid < N) {
d_out[gid] = d_in[gid]*d_in[gid];
}
}
void squareCPU(float* d_in, float* d_out, unsigned int N) {
for(unsigned int i = 0; i < N; i++) {
d_out[i] = d_in[i]*d_in[i];
}
}
Running these functions 100 times on an array of 5000 32-bit floats, I get the following using a small test program
Size of array:
5000
Block size:
256
You chose N=5000 and block size: 256
Total time for GPU: 403 microseconds (0.40ms)
Total time for CPU: 137 microseconds (0.14ms)
Increasing the size of the array to 1000000, I get:
Size of array:
1000000
Block size:
256
You chose N=1000000 and block size: 256
Total time for GPU: 1777 microseconds (1.78ms)
Total time for CPU: 48339 microseconds (48.34ms)
I am not including time used to transfer data between host and device(and vice versa), in fact, here is the relevant part of my testing procedure:
gettimeofday(&t_start, NULL);
for(int i = 0; i < 100; i++) {
squareGPU<<< num_blocks, block_size>>>(d_in, d_out, N);
} cudaDeviceSynchronize();
gettimeofday(&t_end, NULL);
After choosing a block size, I compute the number of blocks relatively to the array size: unsigned int num_blocks = ((array_size + (block_size-1)) / block_size);