Summary:
I'm trying to write a memory bound OpenCL program that comes close to the advertised memory bandwidth on my GPU. In reality I'm off by a factor of ~50.
Setup:
I only have a relatively old Polaris Card (RX580), so I can't use CUDA and have to settle on OpenCL for now. I know this is suboptmial, and I can't get any debugging/performance counters to work, but it's all I have.
I'm new to GPU computing and want to get a feel for some of the performance that I can expect from GPU vs CPU. First thing to work on for me is memory bandwidth.
I wrote a very small OpenCL Kernel, which reads from strided memory locations in a way that I want all workers in the wavefront together to perform continuous memory access over a large memory segment, coalescing the accesses. All that the kernel then does with the loaded data is to sum the values up and write the sum back to another memory location at the very end. The code (which I shamelessly copied together from various sources for the most part) is quite simply
__kernel void ThroughputTestKernel(
__global float* vInMemory,
__global float* vOutMemory,
const int iNrOfIterations,
const int iNrOfWorkers
)
{
const int gtid = get_global_id(0);
__private float fAccumulator = 0.0;
for (int k = 0; k < iNrOfIterations; k++) {
fAccumulator += vInMemory[gtid + k * iNrOfWorkers];
}
vOutMemory[gtid] = fAccumulator;
}
I spawn iNrOfWorkers
of these Kernels and measure the time it takes them to finish processing. For my tests I set iNrOfWorkers = 1024
and iNrOfIterations = 64*1024
. From the processing time and the iMemorySize = iNrOfWorkers * iNrOfIterations * sizeof(float)
I calculate a memory bandwidth of around 5GByte/s.
Expectations:
My problem is that memory accesses seem to be one to two orders of magnitude slower than the 256GByte/s that I was led to believe I have available.
The GCN ISA Manual [1] has me assuming that I have 36 CUs, each of which contains 4 SIMD units, each of which process vectors of 16 elements. Therefore I should have 36416 = 2304 processing elements available.
I spawn less than that amount, i.e. 1024, global work units ("threads"). The threads access memory locations in order, 1024 locations apart, so that in each iteration of the loop, the entire wavefront accesses 1024 consecutive elements. Therefore I believe that the GPU should be able to produce consecutive memory address accesses with no breaks in between.
My guess is that, instead of 1024, it only spawns very few threads, one per CU maybe? That way it would have to re-read the data over and over again. I don't know how I would be able to verify that, though.
[1] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf