1
votes

Summary:

I'm trying to write a memory bound OpenCL program that comes close to the advertised memory bandwidth on my GPU. In reality I'm off by a factor of ~50.

Setup:

I only have a relatively old Polaris Card (RX580), so I can't use CUDA and have to settle on OpenCL for now. I know this is suboptmial, and I can't get any debugging/performance counters to work, but it's all I have.

I'm new to GPU computing and want to get a feel for some of the performance that I can expect from GPU vs CPU. First thing to work on for me is memory bandwidth.

I wrote a very small OpenCL Kernel, which reads from strided memory locations in a way that I want all workers in the wavefront together to perform continuous memory access over a large memory segment, coalescing the accesses. All that the kernel then does with the loaded data is to sum the values up and write the sum back to another memory location at the very end. The code (which I shamelessly copied together from various sources for the most part) is quite simply

__kernel void ThroughputTestKernel(
                     __global float* vInMemory,
                     __global float* vOutMemory,
                     const int iNrOfIterations,
                     const int iNrOfWorkers
                   )
{
    const int gtid = get_global_id(0);
    
    __private float fAccumulator = 0.0;
    
    for (int k = 0; k < iNrOfIterations; k++) {
        fAccumulator += vInMemory[gtid + k * iNrOfWorkers];
    }
    
    vOutMemory[gtid] = fAccumulator;
}

I spawn iNrOfWorkers of these Kernels and measure the time it takes them to finish processing. For my tests I set iNrOfWorkers = 1024 and iNrOfIterations = 64*1024. From the processing time and the iMemorySize = iNrOfWorkers * iNrOfIterations * sizeof(float) I calculate a memory bandwidth of around 5GByte/s.

Expectations:

My problem is that memory accesses seem to be one to two orders of magnitude slower than the 256GByte/s that I was led to believe I have available.

The GCN ISA Manual [1] has me assuming that I have 36 CUs, each of which contains 4 SIMD units, each of which process vectors of 16 elements. Therefore I should have 36416 = 2304 processing elements available.

I spawn less than that amount, i.e. 1024, global work units ("threads"). The threads access memory locations in order, 1024 locations apart, so that in each iteration of the loop, the entire wavefront accesses 1024 consecutive elements. Therefore I believe that the GPU should be able to produce consecutive memory address accesses with no breaks in between.

My guess is that, instead of 1024, it only spawns very few threads, one per CU maybe? That way it would have to re-read the data over and over again. I don't know how I would be able to verify that, though.

[1] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf

1

1 Answers

0
votes

A few issues with your approach:

  • You don't saturate the GPU. To get peak performance, you need to launch much more threads than your GPU has execution units. Much more means >10000000.
  • Your loop contains index integer computation (for array-of-structures coalesced access). Here this is probably not enough to get you into the compute limit, but it's generally better to unroll the small loop with #pragma unroll; then the compiler does all the index calculation already. You can also bake the constants iNrOfIterations and iNrOfWorkers right into the OpenCL code with #define iNrOfIterations 16 / #define iNrOfWorkers 15728640 via C++ string concatenation or by hardcoding.

There is 4 different memory bandwidths based on your access pattern: coalesced/misaligned reads/writes. Coalesced is much faster than misaligned and the performance penalty for misaligned reads is less than misaligned writes. Only coalesced memory access gets you anywhere near the advertised bandwidth. You measure iNrOfIterations coalesced reads and 1 coalesced write. To measure all four types separately, you can use this:

#define def_N 15728640
#define def_M 16
kernel void benchmark_1(global float* data) {
    const uint n = get_global_id(0);
    #pragma unroll
    for(uint i=0; i<def_M; i++) data[i*def_N+n] = 0.0f; // M coalesced writes
}
kernel void benchmark_2(global float* data) {
    const uint n = get_global_id(0);
    float x = 0.0f;
    #pragma unroll
    for(uint i=0; i<def_M; i++) x += data[i*def_N+n]; // M coalesced reads
    data[n] = x; // 1 coalesced write (to prevent compiler optimization)
}
kernel void benchmark_3(global float* data) {
    const uint n = get_global_id(0);
    #pragma unroll
    for(uint i=0; i<def_M; i++) data[n*def_M+i] = 0.0f; // M misaligned writes
}
kernel void benchmark_4(global float* data) {
    const uint n = get_global_id(0);
    float x = 0.0f;
    #pragma unroll
    for(uint i=0; i<def_M; i++) x += data[n*def_M+i]; // M misaligned reads
    data[n] = x; // 1 coalesced write (to prevent compiler optimization)
}

Here the data array has the size N*M and each kernel is executed across the range N. For bandwidth calculation, execute each kernel a few hundred times (better average) and get the average execution times time1, time2, time3 and time4. The bandwidths are then computed like this:

  • coalesced read bandwidth (GB/s) = 4.0E-9f*M*N/(time2-time1/M)
  • coalesced write bandwidth (GB/s) = 4.0E-9f*M*N/( time1 )
  • misaligned read bandwidth (GB/s) = 4.0E-9f*M*N/(time4-time1/M)
  • misaligned write bandwidth (GB/s) = 4.0E-9f*M*N/(time3 )

For reference, here are a few bandwidth values measured with this benchmark.

Edit: How to measure kernel execution time:

  1. Clock
#include <thread>
class Clock {
private:
    typedef chrono::high_resolution_clock clock;
    chrono::time_point<clock> t;
public:
    Clock() { start(); }
    void start() { t = clock::now(); }
    double stop() const { return chrono::duration_cast<chrono::duration<double>>(clock::now()-t).count(); }
};
  1. Time measurement of K executions of a kernel
const int K = 128; // execute kernel 128 times and average execution time
NDRange range_local  = NDRange(256); // thread block size
NDRange range_global = NDRange(N); // N must be divisible by thread block size
Clock clock;
clock.start();
for(int k=0; k<K; k++) {
    queue.enqueueNDRangeKernel(kernel_1, NullRange, range_global, range_local);
    queue.finish();
}
const double time1 = clock.stop()/(double)K;