As part of an algorithm profiling running on GPU I feel that I'm hitting the memory bandwidth.
I have several complex kernels performing some complicated operations (sparse matrix multiplications, reduction etc) and some very simple ones and it seems that all (significant ones) hit ~79GB/s bandwidth wall when I calculate the total data read/written for each one of them, regardless the complexity of them, while the theoretical GPU bandwidth is 112GB/s (nVidia GTX 960)
The data set is very large operating on vectors of ~10,000,000 float entries so I get good measurements/statistics from clGetEventProfilingInfo
between COMMAND_START
and COMMAND_END
. All the data remains in GPU memory during algorithm run so there virtually no host/device memory transfer (also it is not measured by profiling counters)
Even for a very simple kernel (see below) that solves x=x+alpha*b
where x and b are huge vectors of ~10,000,000 entries, I don't get close to the theoretical bandwidth (112GB/s) but rather is running on ~70% of the maximum (~79GB/s)
__kernel void add_vectors(int N,__global float *x,__global float const *b,float factor)
{
int gid = get_global_id(0);
if(gid < N)
x[gid]+=b[gid]*factor;
}
I calculate data transfer for this particular kernel per run as N * (2 + 1) * 4:
- N - size of vector = ~10,000,000
- 2 loads and 1 store per vector entry
- 4 for sizeof float
I expected that for such a simple kernel I need to get much closer to the bandwidth limits, what do I miss?
P.S.: I get similar numbers from CUDA implementation of the same algorithm