1
votes

I wrote a small OpenCL application which calculates the product of two matrices. Now I've noticed that if the size of the matrix exceeds 8192 x 8192 there is a significant performance drop (calculation for a 16384 x 16384 is ~80 times slower) and even the serial implementation is over 5 times faster. Here is the host code:

/*Make some includes and definitions here*/
#include "stdafx.h"
#include <CL/cl.hpp>

#include <vector>
#include <iostream>

#include "util.hpp" // utility library

#define __CL_ENABLE_EXCEPTIONS
#define ROWS (16384)    // ROWS of vectors a, b, and c
#define COLUMNS (16384)

/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
#include "metrics.h"

/*Start main()*/

int main(void)
{
    int A;

    // Fill vectors X and Y with random float values

    float* h_x = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_x[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
        }
    }
    float* h_y = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_y[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
        }
    }
    float* h_s = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_s[j + i*COLUMNS] = 0.0;
        }
    }

    /*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/

    // Get all platforms (drivers)

    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);


    if (all_platforms.size() == 0){ // Check for issues
        std::cout << " No platforms found. Check OpenCL installation!\n";
        exit(1);
    }

    cl::Platform default_platform = all_platforms[0];
    std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << "\n";

    // Get default device of the default platform

    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);

    if (all_devices.size() == 0){ // Check for issues
        std::cout << " No devices found. Check OpenCL installation!\n";
        exit(1);
    }

    cl::Device default_device = all_devices[0];
    std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << "\n";

    // Create an OpenCL context

    cl::Context context({ default_device });

    cl::Program program(context, util::loadProgram("saxy_kernel.cl"), true);

    if (program.build({ default_device }) != CL_SUCCESS){
        std::cout << " Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << "\n";
        getchar();
        exit(1);
    }

    // create buffers on the device
    cl::Buffer buffer_X(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_Y(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_S(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int));

    //create queue to which we will push commands for the device.
    cl::CommandQueue queue(context, default_device);

    //write arrays A and B to the device
    queue.enqueueWriteBuffer(buffer_X, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_x[0]);
    queue.enqueueWriteBuffer(buffer_Y, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_y[0]);
    queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int), &A);

    StartCounter();
    //run the kernel
    cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
    kernel_add.setArg(0, buffer_X);
    kernel_add.setArg(1, buffer_Y);
    kernel_add.setArg(2, buffer_S);
    kernel_add.setArg(3, buffer_A);

    cl::NDRange global(ROWS*COLUMNS);
    queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, global, cl::NullRange);
    queue.finish();

    std::cout << "Kernel execution time: " << GetCounter() << "ms \n";

    //read result C from the device to array C
    queue.enqueueReadBuffer(buffer_S, CL_TRUE, 0, sizeof(float)*ROWS*COLUMNS, &h_s[0]);



    /*Print vectors
    std::cout << "\nMatrix #1: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_x[i] << "\t ";

    }

    std::cout << "\n\nMatrix #2: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_y[i] << "\t ";

    }

    std::cout << "\n\nResult: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_s[i] << "\t ";

    }*/
    getchar();
    return 0;
}

and here is the kernel:

__kernel void kernel simple_add(
   __global float* X, 
   __global float* Y, 
   __global float* S, 
   __global int *A){

   S[get_global_id(0)] = X[get_global_id(0)] * Y[get_global_id(0)];

}

Could you please explain me the reason? I know that I can achieve much better performance if I perform some algorithm optimizations, but I'm trying to figure out if this is the threshold of the "naive" implementation, or I'm doing something wrong (incorrect assignment of the work to groups).

EDIT: Because I was asked for in comments, the GPU I'm running the kernel is an AMD R9 270/2GB RAM. The CPU is an i7-4771 and the system has 8GB RAM.

1
Whilst OpenCL is quite portable, it's not always "performance portable", so you probably should tell us what hardware you are running on. Also doing ONE multiplication per kernel thread is likely quite slow on most architectures. I'd do something like a 16 x 16, 4 x 32 or 256 x 256. How big is "right" will depend on your hardware, but it's PROBABLY better to have slightly less threads and more in each thread. Check if the documentation for your OpenCL implementation is saying something on the subject of "tuning your kernels", etc.Mats Petersson
So if you have 2GB of dedicated graphics memory, storing 3 x sizeof(float) x 16384 x 16384 will not happen, as that takes up 3GB. That could explain why it's very slow - I'm not sure what the AMD/ATI OpenCL driver does in such a situation, but in one way or another it will have to store at least SOME of your allocations in main memory, and either use GPU paging to swap it back and forth in lumps, or use some kind of bus to access the data in CPU-memory space. Either of those are going to make it quite slow, I expect [I don't have ANY knowledge of what ATI/AMD are doing in their products]Mats Petersson
@MatsPetersson Thank you for your suggestions. I'm curious to find out if this is happening due to memory issues or because of the few calculations per thread. But what should I do in order to change the calculations per thread?Arkoudinos
Your host code is also timing the data transfer back from the device, which is probably dominating the runtime you are measuring. For benchmarking this sort of thing, I would usually run the kernel many times in a loop and take the average, which gives the device a chance to warm up.jprice
Yes, I've already seen and corrected that - I should have corrected that here also. The whole deal was about a few milliseconds faster for a 8192x8192 (~50ms kernel execution time) and about 0.5sec for a larger one 16384x16384 (about 7 seconds kernel execution time). Now, about your suggestion, I'm going to test it and see what happens.Arkoudinos

1 Answers

2
votes

Writing an answer about "how to do more calculations per thread" because code-formatting is non-existent in comments, and also covering a little on memory usage...

So, most OpenCL implementatins will need to run more than a couple of instructions per thread (and the right number of threads) for efficient performance. But like I said in comments, this is HIGHLY dependent on the actual architecture of the processing unit (GPU, CPU, or OpenCL-capable magical unit weaved from unicorn hair, whatever it may be) - each manufacturer of GPUs, CPUs and unicorn weavers have their own ideas of how to make a very efficient unit, and they all tend to change their mind as time flows too... ;)

To do a little more work in one thread you could simply do:

#define NUM_PER_THREAD 16
__kernel void kernel simple_add(
 __global float* X, 
 __global float* Y, 
 __global float* S, 
 __global int *A)
{

   for(i = 0; i < NUM_PER_THREAD; i++)
   {
      size_t index = get_global_id(0)*NUM_PER_THREAD + i;
      S[index] = X[index] * Y[index];
   }
}

[This will do 1 x 16 blocks. It gets a bit more fun to try to do 16 x 16 or something like that, but can be done if you know the size (width) of the matrix]

Regarding memory: GPU's that have dedicated local memory (in other words most graphics cards) will work MUCH faster if all the data fits in the graphics memory. Accessing "main" memory involves one of two approaches:

  1. long access times for each cache-line when the GPU is reading over the PCI-express bus [or whatever infrastructure is used] - this can be 100 or 1000x slower than "local" memory. And the GPU also (most likely) has to ask the CPU if the memory content is in cache, and if so, wait further for the CPU to copy the data out to main memory...
  2. "page in/out" where the GPU stops, sends an interrupt to the CPU, the CPU finds some suitable lump [lump in this context is the technical term for "some amount of memory most likely around 4K or multiple thereof"] of memory to "remove" from the GPU memory, and copies that out to main memory, then copies in the required other lump of memory to the GPU memory - similar to when the OS is swapping memory to/from the hard-disk. And if you are unlucky, the GPU also has to do some interesting cache or TLB flushing to ensure that the correct data is being used.

Note that I still (in the last hour or so) haven't got any particular insight in how the AMD/ATI GPU's work, or how their OpenCL driver works. The above is a mixture of guessing/knowing how GPUs work in general, understanding of how OpenCL works in general, and calculating the memory needed to store the three different arrays of 16K x 16K using float.