0
votes

I would like know how can I execute two or more different kernels in parallel and at the same time? Obviously in the same GPU using OpenCL. My main idea is to use two different kernels (kernel A and Kernel B) but they need to use the same memory (I do not want to duplicate the memory by using one buffer for each in the “a” and “b” pointers). So is there another way I can accomplish the dual execution with an efficient memory technique? The codes of the kernels are the following: Kernel A:

_kernel  void kernelA(global struct VectorStruct* a, int aLen0, global struct VectorStruct* b, int bLen0, global struct VectorStruct* c, int cLen0) {
int i = get_local_id(0);
c[(i)].x = a[(i)].x + b[(i)].x; }

Kernel B:

_kernel  void kernelB(global struct VectorStruct* a, int aLen0, global struct VectorStruct* b, int bLen0, global struct VectorStruct* d, int cLen0){ int i = get_local_id(0); d[(i)].y = a[(i)].y + b[(i)].y; }

The definition for the struct VectorStruct is the following:

struct VectorStruct { int x; int y; };

In the host code I have to create four pointers: VectorStruct* a VectorStruct* b VectorStruct* c VectorStruct* d The poiner “a” and “b” have the data that I will transfer to GPU. The pointer “c” will storage the results of the kernel A, and the pointer “d” will storage the results of the kernel B.

1

1 Answers

0
votes

You can enqueue your 2 kernels with clEnqueueNDRangeKernel() on a concurrent command queue, i.e. one where CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE was passed during the clCreateCommandQueue. Then pass both created event objects to the buffer read or map call for reading out the result from the host. Note that not all hardware and OpenCL implementations supports concurrent execution of different kernels, so they may end up being serialised to some extent after all.

You can also achieve something similar with multiple serial command queues.

For your simple kernel it may be better to use a float2 to represent your vector and perform a vectorised (SIMD) addition in a single kernel. The OpenCL compiler should pick up on the vector operations and distribute the operations across the parallel hardware automatically.

For slightly more complicated operations where this doesn't work so well, you could represent the vector's x and y coordinates as a 2-element array, and simply enqueue twice the number of work-items on one kernel that works on alternating dimensions.

Both approaches will give you much more efficient memory access patterns.

Note that your use of get_local_id(0) might be erroneus, depending on what you want to achieve - you probably want to be using get_global_id(0) in this case.