I just started learning CUDA and I have a trouble interpreting my experiment results. I wanted to compare CPU vs GPU in a simple program that adds two vectors together. The code is following:
__global__ void add(int *a, int *b, int *c, long long n) {
long long tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) {
c[tid] = a[tid] + b[tid];
}
}
void add_cpu(int* a, int* b, int* c, long long n) {
for (long long i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
void check_results(int* gpu, int* cpu, long long n) {
for (long long i = 0; i < n; i++) {
if (gpu[i] != cpu[i]) {
printf("Different results!\n");
return;
}
}
}
int main(int argc, char* argv[]) {
long long n = atoll(argv[1]);
int num_of_blocks = atoi(argv[2]);
int num_of_threads = atoi(argv[3]);
int* a = new int[n];
int* b = new int[n];
int* c = new int[n];
int* c_cpu = new int[n];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, n * sizeof(int));
cudaMalloc((void **) &dev_b, n * sizeof(int));
cudaMalloc((void **) &dev_c, n * sizeof(int));
for (long long i = 0; i < n; i++) {
a[i] = i;
b[i] = i * 2;
}
cudaMemcpy(dev_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
sdkStopTimer(&timer);
float time = sdkGetTimerValue(&timer);
sdkDeleteTimer(&timer);
cudaMemcpy(c, dev_c, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
clock_t start = clock();
add_cpu(a, b, c_cpu, n);
clock_t end = clock();
check_results(c, c_cpu, n);
printf("%f %f\n", (double)(end - start) * 1000 / CLOCKS_PER_SEC, time);
return 0;
}
I ran this code in a loop with a bash script:
for i in {1..2560}
do
n="$((1024 * i))"
out=`./vectors $n $i 1024`
echo "$i $out" >> "./vectors.txt"
done
Where 2560 is maximum number of blocks that my GPU supports, and 1024 is the maximum number of threads in block. So I just ran it for maximum block size to the maximum problem size my GPU can handle, with a step of 1 block (1024 ints in vector).
Here is my GPU info:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2070 SUPER"
CUDA Driver Version / Runtime Version 11.3 / 11.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 8192 MBytes (8589934592 bytes)
(040) Multiprocessors, (064) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 65536 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS
After running the experiment I gathered the results and plotted them:
So what bothers me is this 256 blocks-wide period in the GPU execution time. I have no clue why this happens. Why executing 512 blocks is much slower than executing 513 blocks of threads?
I also checked this with a constant number of blocks (2560) as well as with different block sizes and it always give this period of 256 * 1024 vector size (so for block size 512 its each 512 blocks, not each 256 blocks). So maybe this is something with memory, but I can't figure out what.
I would appreciate any ideas on why this is happening.
c_cpu
is not read/written before. Furthermore, why do you say "2560 is maximum number of blocks that my GPU supports" ? This GPU should be able to support much more block per kernel. According to this page, the limit should be very high in this case. – Jérôme Richardnvcc -I cuda-samples/Common lab1/vectors.cu -o vectors
to compile it, I have no idea if I use optimization or not, I am completely new to CUDA. Regarding 2560 blocks - yep I think I misunderstood how it works, so now if I understand correctly I have 128 * 2560 concurrent blocks that I can use? – zuroslav-O3
for the CPU code to be fast. For the block: no, 2560 CUDA core can work simultaneously split in 40 multi-processors (ie. SM). So, 2560 threads can actively run in parallel. You can find more information here. You can have a lot of block but not all may be computed simultaneously (and this is fine). You can find more information about how blocks are computed here. – Jérôme Richard