In my current OpenCL implementation, I wanted to save time with arguments, avoid to pass them every time I wanted to use a buffer inside a kernel and have a shorter argument list for my kernel.
So I made a structure (workspace) that holds the pointer to the buffer in device memory, the struct act like an object with member variable you want to access through time and you want to stay alive for the whole execution. I never had a problem on AMD GPU or even on CPU. But Nvidia causing a lot of problems with this. It always seems to be an alignment problem, never reaching to right buffer, etc.
Here some code to help, see question below:
The structure define on host:
#define SRC_IMG 0 // (float4 buffer) Source image
#define LAB_IMG 1 // (float4 buffer) LAB image
// NOTE: The size of this array should be as much as the last define + 1.
#define __WRKSPC_SIZE__ 2
// Structure defined on host.
struct Workspace
{
cl_ulong getPtr[__WRKSPC_SIZE__];
};
struct HostWorkspace
{
cl::Buffer srcImg;
cl::Buffer labImg;
};
The structure defined on device:
typedef struct __attribute__(( packed )) gpuWorkspace
{
ulong getPtr[__WRKSPC_SIZE__];
} gpuWorkspace_t;
Note that on device, I use ulong and on host I use cl_ulong as shown here OpenCL: using struct as kernel argument.
So once cl::Buffer for source image or LAB image are created, I save them into a HostWorkspace object, so until that object is released, the reference to cl::Buffer is kept, so buffer exists for the entire project on the host, and defacto on the device.
Now, I need to feed those the device, so I have a simple kernel which init my device workspace as follow:
__kernel void Workspace_Init(__global gpuWorkspace_t* wrkspc,
__global float4* src,
__global float4* LAB)
{
// Get the ulong pointer on the first element of each buffer.
wrkspc->getPtr[SRC_IMG] = &src[0];
wrkspc->getPtr[LAB_IMG] = &LAB[0];
}
where wrkspc is a buffer allocated with struct Workspace
, and src + LAB are just buffer allocate as 1D array images.
And afterwards, in any of my kernel, if I want to use src or LAB, I do as follow:
__kernel void ComputeLABFromSrc(__global gpuWorkspace_t* wrkSpc)
{
// =============================================================
// Get pointer from work space.
// =============================================================
// Cast back the pointer of first element as a normal buffer you
// want to use along the execution of the kernel.
__global float4* srcData = ( __global float4* )( wrkSpc->getPtr[SRC_IMG] );
__global float4* labData = ( __global float4* )( wrkSpc->getPtr[LAB_IMG] );
// Code kernel as usual.
}
When I started to use this, I had like 4-5 images which was going well, with a different structure like this:
struct Workspace
{
cl_ulong imgPtr;
cl_ulong labPtr;
};
where each image had there own pointer.
At a certain point I reach more images, and I had some problem. So I search online, and I found some recommendation that the sizeof() the struct could be different in-between device/host, so I change it to a single array of the same time, and this works fine until 16 elements.
So I search more, and I found a recommendation about the attribute((packed)), which I put on the device structure (see above). But now, I reach 26 elements, when I check the sizeof the struct either on device or on host, the size is 208 (elements * sizeof(cl_ulong) == 26 * 8). But I still have a similar issue to my previous model, my pointer goes read somewhere else in the middle of the previous image, etc.
So I have wondering, if anyone ever try a similar model (maybe with a different approach) or have any tips to have a "solid" model with this.
Note that all kernel are well coded, I have a good result when executing on AMD or on CPU with the same code. The only issue is on Nvidia.
std::vector
andstd::array
as well as pass by reference. The C language doesn't have these items. – Thomas Matthews