2
votes

As part of an algorithm profiling running on GPU I feel that I'm hitting the memory bandwidth.

I have several complex kernels performing some complicated operations (sparse matrix multiplications, reduction etc) and some very simple ones and it seems that all (significant ones) hit ~79GB/s bandwidth wall when I calculate the total data read/written for each one of them, regardless the complexity of them, while the theoretical GPU bandwidth is 112GB/s (nVidia GTX 960)

The data set is very large operating on vectors of ~10,000,000 float entries so I get good measurements/statistics from clGetEventProfilingInfo between COMMAND_START and COMMAND_END. All the data remains in GPU memory during algorithm run so there virtually no host/device memory transfer (also it is not measured by profiling counters)

Even for a very simple kernel (see below) that solves x=x+alpha*b where x and b are huge vectors of ~10,000,000 entries, I don't get close to the theoretical bandwidth (112GB/s) but rather is running on ~70% of the maximum (~79GB/s)

__kernel void add_vectors(int N,__global float *x,__global float const *b,float factor)
{
    int gid = get_global_id(0);
    if(gid < N)
        x[gid]+=b[gid]*factor;
}

I calculate data transfer for this particular kernel per run as N * (2 + 1) * 4:

  • N - size of vector = ~10,000,000
  • 2 loads and 1 store per vector entry
  • 4 for sizeof float

I expected that for such a simple kernel I need to get much closer to the bandwidth limits, what do I miss?

P.S.: I get similar numbers from CUDA implementation of the same algorithm

1

1 Answers

2
votes

I think a more realistic way to evaluate if you have reached the peak bandwidth is to compare what you get with a simple D2D copy.

For example your kernel read x and b once and write x once, so the upper limit of the execution time should be 1.5x time of copying from b to x once. If you find the time is much higher than 1.5x, it means you probably have space to improve. In this kernel the work is so simple that the overhead (starting and ending the function, computing the index, etc.) may limit the performance. If this is an issue, you may find increasing the work per thread with a grid stride loop helps.

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

As for the theoretical bandwidth, at least you should consider the overhead of ECC if it is enabled.