Trying to create a simply copy/paste values in OpenCL 1.1 using CL_MEM_USE_HOST_PTR, why does it not work?

Question

I am quite new to the embedded and OpenCL, I am currently trying to develop a sample code to execute to an i.MX6q board which supports OpenCL 1.1 EP.

I had to start from scratch so I followed these tutorials, the OpenCL 1.1 Reference pages and also this OpenCL example to make my first OpenCL implementation/application.

Basically what I would like to do is to develop a "performance test" to run on the board. It consists on having two int arrays (input and output), filling the first one with random values and pasting it into the output array using OpenCL work items.

I was quite confused between clEnqueue(Read/Write)Buffer functions and the clCreateBuffer flags (especially CL_MEM_USE_HOST_PTR) so I decided to have a look and to practice with it.

My code compiles correctly and runs correctly however when I'm reading the output array values, they still stay at 0.

Here is my code (which is C++) :

void    buffer_copy(char* kernelfile)
{
    cl_platform_id      platform_id;
    cl_device_id        device_id;
    cl_context          context;
    cl_command_queue    cmd_queue;
    cl_program          program;

    //  Retrieving all the OpenCL data needed
    //  to start the performance test
    platform_id = get_platform();
    device_id = get_device(platform_id);
    context = get_context(platform_id, device_id);
    cmd_queue = get_command_queue(context, device_id);
    program = get_program(context, kernelfile);

    cl_mem      buffer_input, buffer_output;
    size_t      buffer_width = 640, buffer_height = 480;
    size_t      buffer_size = buffer_width * buffer_height;
    cl_kernel   kernel;
    cl_int      err = 0;
    char*       options = "-Werror -cl-std=CL1.1";

    int         data_input[buffer_size];
    int         data_output[buffer_size];

    //  Assigning random values in the data_input array and
    //  initializing the data_output array to zero-values
    srand(time(NULL));
    for (size_t index = 0; index < buffer_size; ++index)
    {
        data_input[index] = rand();
        data_output[index] = 0;
    }

    //  Creating OpenCL buffers
    buffer_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, buffer_size * sizeof(int), data_input, &err);
    assert(err == CL_SUCCESS);
    buffer_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, buffer_size * sizeof(int), data_output, &err);
    assert(err == CL_SUCCESS);

    err = clBuildProgram(program, 1, &device_id, options, NULL, NULL);
    assert(err == CL_SUCCESS);
    kernel = clCreateKernel(program, "buffer_copy", &err);
    assert(err == CL_SUCCESS);

    clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &buffer_output);

    size_t  device_max_work_group_size;
    size_t  global_work_size, local_work_size;
    size_t  preferred_work_group_size_multiple;

    cl_ulong    global_mem_size, max_mem_alloc_size;
    clGetDeviceInfo(device_id, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(cl_ulong), &global_mem_size, NULL);
    clGetDeviceInfo(device_id, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), &max_mem_alloc_size, NULL);
    clGetDeviceInfo(device_id, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &device_max_work_group_size, NULL);
    std::cout << "Global device memory size: " << global_mem_size << " bytes" << std::endl;
    std::cout << "Device max memory allocation size: " << max_mem_alloc_size << " bytes" << std::endl;
    std::cout << "Device max work group size: " << device_max_work_group_size << std::endl;

    clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &global_work_size, NULL);
    std::cout << "global_work_size value: " << global_work_size << std::endl;

    clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &preferred_work_group_size_multiple, NULL);
    local_work_size = global_work_size / preferred_work_group_size_multiple;
    std::cout << "local_work_size value: " << local_work_size << std::endl;

    cl_event events[2];
    err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size, &local_work_size, 0, 0, &events[0]);
    assert (err == CL_SUCCESS);
    err = clEnqueueReadBuffer(cmd_queue, buffer_output, CL_TRUE, 0, buffer_size * sizeof(int), data_output, 0, NULL, &events[1]);
    assert (err == CL_SUCCESS);
    err = clWaitForEvents(2, events);
    assert (err == CL_SUCCESS);

    for (size_t index = 0; index < buffer_size; ++index)
    {
        if (data_input[index] != data_output[index])
        {
            std::cerr << "Error, values differ (at index " << index << ")." << std::endl;
            break;
        }
        else
        {
            //std::cout << "data_input[index] =\t" << data_input[index] << std::endl;
            //std::cout << "data_output[index] =\t" << data_output[index] << std::endl;
        }
    }

    cl_ulong    time_start, time_end;
    double      total_time;
    clGetEventProfilingInfo(events[0], CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(events[1], CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    total_time = time_end - time_start;
    std::cout << "Execution time in milliseconds: " << (total_time / 1000000.0) << " ms" << std::endl;

    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseMemObject(buffer_input);
    clReleaseMemObject(buffer_output);
    clReleaseCommandQueue(cmd_queue);
    clReleaseContext(context);
}

And here is my OpenCL kernel :

__kernel void   buffer_copy(__global int* input, __global int* output)
{
    int id = get_global_id(0);

    output[id] = input[id];
}

Right now I'm just trying to make it work, not optimizing it. And I think I am missing good points here and there but I can't catch them. In my opinion, I am confusing the clCreateBuffer flags.

Could you guys enlighten me and help me on this one ?

EDIT: updated code + new infos !

It seems that values are well pasted, but only according to the kernel work group size: the CL_DEVICE_MAX_WORK_GROUP_SIZE returns 1024 and the CL_KERNEL_WORK_GROUP_SIZE also returns 1024 (which is strange also). So the first 1024 integers of my array are well copied/pasted but after that it does not work anymore. To verify this, I set the global_work_group_size manually to 32, ran my program again and then the only first 32 integers are pasted correctly. I really don't understand what is going on here.

Major Squirrel Major Squirrel · Accepted Answer · 2017-09-24T15:00:53

I think I was able to make it work for both my laptop and the i.MX6q board.

Here is the code that works :

void    buffer_copy(char* kernelfile)
{
    cl_platform_id      platform_id;
    cl_device_id        device_id;
    cl_context          context;
    cl_command_queue    cmd_queue;
    cl_program          program;

    //  Retrieving all the OpenCL data needed
    //  to start the performance test
    platform_id = get_platform();
    device_id = get_device(platform_id);
    context = get_context(platform_id, device_id);
    cmd_queue = get_command_queue(context, device_id);
    program = get_program(context, kernelfile);

    cl_mem      buffer_input, buffer_output;
    size_t      buffer_width = 640, buffer_height = 480;
    size_t      buffer_size = buffer_width * buffer_height;
    cl_kernel   kernel;
    cl_int      err = 0;
    char*       options = "-Werror -cl-std=CL1.1";

    int         data_input[buffer_size];
    int         data_output[buffer_size];

    //  Assigning random values in the data_input array and
    //  initializing the data_output array to zero-values
    srand(time(NULL));
    for (size_t index = 0; index < buffer_size; ++index)
    {
        data_input[index] = rand();
        data_output[index] = 0;
    }

    //  Creating OpenCL buffers
    buffer_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, buffer_size * sizeof(int), data_input, &err);
    assert(err == CL_SUCCESS);
    buffer_output = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, buffer_size * sizeof(int), data_output, &err);
    assert(err == CL_SUCCESS);

    err = clBuildProgram(program, 1, &device_id, options, NULL, NULL);
    assert(err == CL_SUCCESS);
    kernel = clCreateKernel(program, "buffer_copy", &err);
    assert(err == CL_SUCCESS);

    clSetKernelArg(kernel, 0, sizeof(cl_mem), &buffer_input);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &buffer_output);

    cl_ulong    global_mem_size = 0, max_mem_alloc_size = 0;
    size_t      device_max_work_group_size = 0;
    size_t      kernel_work_group_size = 0;
    size_t      preferred_work_group_size_multiple = 0;
    clGetDeviceInfo(device_id, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(cl_ulong), &global_mem_size, NULL);
    clGetDeviceInfo(device_id, CL_DEVICE_MAX_MEM_ALLOC_SIZE, sizeof(cl_ulong), &max_mem_alloc_size, NULL);
    clGetDeviceInfo(device_id, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(size_t), &device_max_work_group_size, NULL);
    clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &kernel_work_group_size, NULL);
    clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), &preferred_work_group_size_multiple, NULL);
    std::cout << "CL_DEVICE_GLOBAL_MEM_SIZE : " << global_mem_size << " bytes" << std::endl;
    std::cout << "CL_DEVICE_MAX_MEM_ALLOC_SIZE : " << max_mem_alloc_size << " bytes" << std::endl;
    std::cout << "CL_DEVICE_MAX_WORK_GROUP_SIZE : " << device_max_work_group_size << std::endl;
    std::cout << "CL_KERNEL_WORK_GROUP_SIZE : " << kernel_work_group_size << std::endl;
    std::cout << "CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE : " << preferred_work_group_size_multiple << std::endl;

    cl_event    events[2];
    err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &buffer_size, &kernel_work_group_size, 0, NULL, &events[0]);
    assert (err == CL_SUCCESS);
    err = clEnqueueReadBuffer(cmd_queue, buffer_output, CL_TRUE, 0, buffer_size * sizeof(int), data_output, 1, &events[0], &events[1]);
    assert (err == CL_SUCCESS);
    err = clWaitForEvents(2, events);
    assert (err == CL_SUCCESS);

    for (size_t index = 0; index < buffer_size; ++index)
    {
        if (data_input[index] != data_output[index])
        {
            std::cerr << "Error, values differ (at index " << index << ")." << std::endl;
            break;
        }
    }

    cl_ulong    time_start, time_end;
    double      total_time;

    clGetEventProfilingInfo(events[0], CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(events[0], CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    total_time = time_end - time_start;
    std::cout << "clEnqueueNDRangeKernel execution time in milliseconds: " << (total_time / 1000000.0) << " ms" << std::endl;
    clGetEventProfilingInfo(events[1], CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
    clGetEventProfilingInfo(events[1], CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
    total_time = time_end - time_start;
    std::cout << "clEnqueueReadBuffer execution time in milliseconds: " << (total_time / 1000000.0) << " ms" << std::endl;

    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseMemObject(buffer_input);
    clReleaseMemObject(buffer_output);
    clReleaseCommandQueue(cmd_queue);
    clReleaseContext(context);
}

So as you can see, I am just copying 640*480 (307200) integers from one array to another one using OpenCL 1.1 EP.

I allocated both memory buffers from host side and told OpenCL to use them through the host pointers (meaning no memcpy if I am right).

Here is the output from my laptop (working on GeForce GTX 765m) :

CL_DEVICE_GLOBAL_MEM_SIZE : 2094923776 bytes
CL_DEVICE_MAX_MEM_ALLOC_SIZE : 523730944 bytes
CL_DEVICE_MAX_WORK_GROUP_SIZE : 1024
CL_KERNEL_WORK_GROUP_SIZE : 1024
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE : 32

clEnqueueNDRangeKernel execution time in milliseconds: 0.061856 ms
clEnqueueReadBuffer execution time in milliseconds: 0.100544 ms

Here is the output from the i.MX6q SoM (working on Vivante GC2000 GPU) :

CL_DEVICE_GLOBAL_MEM_SIZE : 67108864 bytes
CL_DEVICE_MAX_MEM_ALLOC_SIZE : 33554432 bytes
CL_DEVICE_MAX_WORK_GROUP_SIZE : 1024
CL_KERNEL_WORK_GROUP_SIZE : 176
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE : 16

clEnqueueNDRangeKernel execution time in milliseconds: 4.463 ms
clEnqueueReadBuffer execution time in milliseconds: 7.199 ms

What was wrong ?
I think that I was giving wrong global_work_size and local_work_size values to the clEnqueueNDRangeKernel function. Yet, I still really don't understand how do they work and how to calculate them. I still don't understand the difference between those values and CL_KERNEL_WORK_GROUP_SIZE and how is the kernel work group size calculated by OpenCL compiler. Why is the CL_KERNEL_WORK_GROUP_SIZE different between the SoM and my laptop ? I use the same kernel though.

Any optimizations to recommend ?
If you have any optimizations to recommend to me, I would be grateful ! All of this context is to learn how to do some image processing and developing the algorithms to make them work with OpenCL (since I can't use OpenCV on this SoM).

Trying to create a simply copy/paste values in OpenCL 1.1 using CL_MEM_USE_HOST_PTR, why does it not work?

1 Answers