Why is the cpu faster than the gpu for small inputs?

Question

I have experienced that the CPU executes faster than the GPU for small input sizes. Why is this? Preparation, data transfer or what?

For example for the kernel and CPU function(CUDA code):

__global__ void squareGPU(float* d_in, float* d_out, unsigned int N) {
    unsigned int lid = threadIdx.x;
    unsigned int gid = blockIdx.x*blockDim.x+lid;
    if(gid < N) {
        d_out[gid] = d_in[gid]*d_in[gid]; 
    }
}

void squareCPU(float* d_in, float* d_out, unsigned int N) {
    for(unsigned int i = 0; i < N; i++) {
        d_out[i] = d_in[i]*d_in[i]; 
    }
}

Running these functions 100 times on an array of 5000 32-bit floats, I get the following using a small test program

Size of array:
5000
Block size:
256

You chose N=5000 and block size: 256

Total time for GPU: 403 microseconds (0.40ms)
Total time for CPU: 137 microseconds (0.14ms)

Increasing the size of the array to 1000000, I get:

Size of array:
1000000
Block size:
256

You chose N=1000000 and block size: 256

Total time for GPU: 1777 microseconds (1.78ms)
Total time for CPU: 48339 microseconds (48.34ms)

I am not including time used to transfer data between host and device(and vice versa), in fact, here is the relevant part of my testing procedure:

gettimeofday(&t_start, NULL);

for(int i = 0; i < 100; i++) {
    squareGPU<<< num_blocks, block_size>>>(d_in, d_out, N);
} cudaDeviceSynchronize();

gettimeofday(&t_end, NULL);

After choosing a block size, I compute the number of blocks relatively to the array size: unsigned int num_blocks = ((array_size + (block_size-1)) / block_size);

Could it possibly be because the CPU is faster than the GPU in general? — Lundin
Offloading to the GPU generally only makes sense when you have computationally expensive tasks that are easily parallelizable. Otherwise the CPU will almost always be faster. — Peter
@Lundin If the CPU was faster than the GPU in general, the GPU would not not become faster after an input size of n(asymptotically faster). I am wondering why the CPU is faster for small inputs.Did you see the output of my tests? :) — Tihi
@Peter The GPU already starts to get faster for the simple task of squaring(as shown in my examples) when the input size is 1000000. I would not consider squaring 1000000 elements an expensive task(I think). But what would be a likely reason that the CPU executes faster for smaller inputs(like squaring an array of 5000 elements)? — Tihi
@Tihi Think of the GPU as comprised of many (many) cores that are less powerful than the few cores of your CPU. Only when there is a lot of work to do and you can efficiently distribute it across those GPU cores will the GPU be faster. Expensive is of course relative, generally, with any problem of this nature you will always observe an input size cutoff point after which the GPU outperforms the CPU. — Peter

Robert Crovella Robert Crovella · Accepted Answer · 2020-09-10T15:44:55

Answering the general question of CPU vs. GPU performance comparison is fairly complicated, and generally involves consideration of at least 3 or 4 different factors that I can think of. However you've simplified the problem somewhat by isolating your measurement to the actual calculations, as opposed to the data transfer, or the "complete operation".

In this case, there are probably at least 2 things to consider:

Kernel launch overhead - Launching a kernel on a GPU carries and "approximately" fixed cost overhead, usually in the range of 5 to 50 microseconds, per kernel launch. This means that if you size the amount of work such that your CPU can do it in less than that amount of time, there is no way the GPU can be faster. Even above that level, there is a linear function which describes that overhead model, which I'm sure you can work out if you wish, to compare CPU vs. GPU performance in the presence of a fixed cost overhead. When comparing small test cases, this is an important factor to consider, however my guess is that because most of your test case timings are well above 50 microseconds, we can safely "ignore" this factor, as an approximation.
The actual performance/capability of the actual CPU vs. the actual GPU. This is generally hard to model, depends on the specific hardware you are using, and you haven't provided that info. However we can make some observations anyway, and some conjecture, expanding on this in the next section, based on the data you have provided.

Your two cases involve a total amount of work described by N, considering N=5000 and N=1000000. Building a little chart:

      N  |  CPU time    |  GPU time
   5000  |    137       |  403
1000000  |  48339       | 1777

So we see that in the case of the CPU, when the work increased by a factor of 200, the execution time increased by a factor of ~352, whereas in the GPU case, the execution time increased by a factor of ~4.5. We'll need to explain both of these "non-linearities" in order to have a reasonable guess as to what is going on.

Effects of cache - because you are running your test cases 100 times, the caches could have an effect. In the CPU case, this is my only guess as to why you are not seeing a linear relationship. I would guess that at the very small size, you are in some CPU "inner" cache, with 40KB of data "in view". Going to the larger size, you have 8MB of data in view, and although this probably fits in the "outer" cache on your CPU, its possible it may not, and even if it does, the outer cache may yield slower overall performance than the inner cache. I would guess this is the reason for the CPU appearing to get worse as the data gets larger. Your CPU is being affected non-linearly in a negative way, from the larger data set. In the GPU case, the outer cache is at most 6MB (unless you are running on a Ampere GPU), so your larger data set does not fit completely in the outer cache.
Effects of machine saturation - both the CPU and GPU can be fully "loaded" or partially loaded, depending on the workload. In the CPU case, I am guessing you are not using any multi-threading, therefore your CPU code is restricted to a single core. (And, your CPU almost certainly has multiple cores available.) Your single threaded code will approximately "saturate" i.e. keep that single core "busy". However the GPU has many cores, and I would guess that your smaller test case (which will work out to 5000 threads) will only partially saturate your GPU. What I mean is that some of the GPU thread processing resources will be idle in the smaller case (unless you happen to be running on the smallest of GPUs). 5000 threads is only about enough to keep 2 GPU SMs busy, so if your GPU has more than 2 SMs, some of its resource is idle during the smaller test case, whereas your million-thread larger test case is enough to saturate i.e. keep all thread processing resources busy, on any current CUDA GPU. The effect of this is that while the CPU doesn't benefit at all from a larger test case (you should consider using multi-threading), your GPU is likely benefitting. The larger test case allows your GPU to do more work in the same amount of time that the smaller test case is taking. Therefore the GPU benefits non-linearly in a positive way, from the larger workload.

The GPU is also better able to mitigate the effects of missing in the outer cache, when it is given a large enough workload. This is called the latency-hiding effect of the GPU in the presence of a "large" parallel workload, and the CPU doesn't have (or doesn't have as much of) a corresponding mechanism. So depending on your exact CPU and GPU, this could be an additional factor. I don't intend to give a full tutorial on latency-hiding here, but the concept is based partially on the item 2 above, so you may gather the general idea/benefit from that.

Why is the cpu faster than the gpu for small inputs?

1 Answers