0
votes

In my current OpenCL implementation, I wanted to save time with arguments, avoid to pass them every time I wanted to use a buffer inside a kernel and have a shorter argument list for my kernel.

So I made a structure (workspace) that holds the pointer to the buffer in device memory, the struct act like an object with member variable you want to access through time and you want to stay alive for the whole execution. I never had a problem on AMD GPU or even on CPU. But Nvidia causing a lot of problems with this. It always seems to be an alignment problem, never reaching to right buffer, etc.

Here some code to help, see question below:

The structure define on host:

 #define SRC_IMG 0       // (float4 buffer) Source image
 #define LAB_IMG 1       // (float4 buffer) LAB image

 // NOTE: The size of this array should be as much as the last define + 1.
 #define __WRKSPC_SIZE__ 2 

 // Structure defined on host.
 struct Workspace
 {
      cl_ulong getPtr[__WRKSPC_SIZE__];
 };

 struct HostWorkspace
 {
      cl::Buffer srcImg;
      cl::Buffer labImg;
 };

The structure defined on device:

typedef struct __attribute__(( packed )) gpuWorkspace
{
    ulong getPtr[__WRKSPC_SIZE__]; 
} gpuWorkspace_t; 

Note that on device, I use ulong and on host I use cl_ulong as shown here OpenCL: using struct as kernel argument.

So once cl::Buffer for source image or LAB image are created, I save them into a HostWorkspace object, so until that object is released, the reference to cl::Buffer is kept, so buffer exists for the entire project on the host, and defacto on the device.

Now, I need to feed those the device, so I have a simple kernel which init my device workspace as follow:

__kernel void Workspace_Init(__global gpuWorkspace_t* wrkspc,
                             __global float4* src,
                             __global float4* LAB)
{
    // Get the ulong pointer on the first element of each buffer.
    wrkspc->getPtr[SRC_IMG] = &src[0];
    wrkspc->getPtr[LAB_IMG] = &LAB[0];
}

where wrkspc is a buffer allocated with struct Workspace, and src + LAB are just buffer allocate as 1D array images.

And afterwards, in any of my kernel, if I want to use src or LAB, I do as follow:

__kernel void ComputeLABFromSrc(__global gpuWorkspace_t* wrkSpc)
{
    // =============================================================
    // Get pointer from work space.
    // =============================================================

    // Cast back the pointer of first element as a normal buffer you
    // want to use along the execution of the kernel.
    __global float4* srcData = ( __global float4* )( wrkSpc->getPtr[SRC_IMG] );
    __global float4* labData = ( __global float4* )( wrkSpc->getPtr[LAB_IMG] );

    // Code kernel as usual.
}

When I started to use this, I had like 4-5 images which was going well, with a different structure like this:

struct Workspace
{
    cl_ulong imgPtr;
    cl_ulong labPtr;
};

where each image had there own pointer.

At a certain point I reach more images, and I had some problem. So I search online, and I found some recommendation that the sizeof() the struct could be different in-between device/host, so I change it to a single array of the same time, and this works fine until 16 elements.

So I search more, and I found a recommendation about the attribute((packed)), which I put on the device structure (see above). But now, I reach 26 elements, when I check the sizeof the struct either on device or on host, the size is 208 (elements * sizeof(cl_ulong) == 26 * 8). But I still have a similar issue to my previous model, my pointer goes read somewhere else in the middle of the previous image, etc.

So I have wondering, if anyone ever try a similar model (maybe with a different approach) or have any tips to have a "solid" model with this.

Note that all kernel are well coded, I have a good result when executing on AMD or on CPU with the same code. The only issue is on Nvidia.

1
Which language, C or C++? They are distinct languages. C++ has std::vector and std::array as well as pass by reference. The C language doesn't have these items.Thomas Matthews
OpenCL is C, so all my device code is in C. But the host code (which I didn't provide that much) use C++.Vuwox

1 Answers

4
votes

Don't try to store GPU-side pointer values across kernel boundaries. They are not guaranteed to stay the same. Always use indices. And if a kernel uses a specific buffer, you need to pass it as an argument to that kernel.

References:

  1. The OpenCL 1.2 specification (as far as I'm aware, nvidia does not implement a newer standard) does not define the behaviour of pointer-to-integer casts or vice versa.
  2. Section 6.9p states: "Arguments to kernel functions that are declared to be a struct or union do not allow OpenCL objects to be passed as elements of the struct or union." This is exactly what you are attempting to do: passing a struct of buffers to a kernel.
  3. Section 6.9a states: "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)." - This is essentially what you're trying to subvert by casting your pointers to an integer and back. (point 1) You can't "trick" OpenCL into this being well-defined by bypassing the type system.

As I suggest in the comment thread below, you will need to use indices to save positions inside a buffer object. If you want to store positions across different memory regions, you'll need to either unify multiple buffers into one and save one index into this giant buffer, or save a numeric value that identifies which buffer you are referring to.