OpenCL enqueued kernels using lots of host memory

Question

I am executing monte carlo sweeps on a population of replicas of my system using OpenCL kernels. After the initial debugging phase I increased some of the arguments to more realistic values and noticed that the program is suddenly eating up large amounts of host memory. I am executing 1000 sweeps on about 4000 replicas, each sweep consists of 2 kernel invocations. That results in about 8 million kernel invocations.

The source of the memory usage was easy to find (see screenshot).

While the the kernel executions are enqueued the memory usage goes up.
While the kernels are executing the memory usage stays constant.
Once the kernels finish up the usage goes down to its original state.
I did not allocate any memory, as can be seen in the memory snapshots.

That means the OpenCL driver is using the memory. I understand that it must keep a copy of all the arguments to the kernel invocations and also the global and local workgroup size, but that does not add up.

The peak memory usage was 4.5GB. Before enqueuing the kernels about 250MB were used. That means OpenCL used about 4.25GB for 8 million invocations, i.e. about half a kilobyte per invocation.

So my questions are:

Is that kind of memory usage normal and to be expected?
Are there good/known techniques to reduce memory usage?
Maybe I should not enqueue so many kernels simultaneously, but how would I do that without causing synchronization, e.g. with clFinish()?

Try clFlush every 100 or 1000 kernel invocation or even more rarely if you observe it hurts the performance. It is a non blocking command which will issue all previously queued commands to the device. — doqtor
Thanks for the hint! I added a clFlush after every 1000 kernel invocations and that already improved it a bit. Peak memory usage went down to 3.5GB and the peak is much shorter, i.e. the downward slope begins much earlier but is not as steep. The impact on performance is either small or non-existant, I can't see any. — Gigo
OK, try clWaitForEvents. Enqueue 1000 kernels adding an event on the last one, then enqueue another 999 kernels and call clWaitForEvents. This will make it wait for 1000th kernel to finish whilst there will be another 999 kernels already in the queue. Then repeat the whole thing similarly. — doqtor
Great, that brought it down to 750MB. Does a clFlush before the clWaitForEvents make any sense? I did not notice any difference. — Gigo
clFlush just before clWaitForEvents rather not because I think it's done by clWaitForEvents anyway. If you want to get memory down more that 750MB I'd try decreasing the number after which you wait for the kernels to finish, for example to 500. This number needs to be worked out so that is optimal for your scenario. I'm going to put what we did here as an answer, accept it if you are happy. — doqtor

doqtor doqtor · Accepted Answer · 2015-08-11T08:29:59

Enqueueing large number of kernel invocations needs to be done in a bit controlled manner so that command queue does not eat too much memory. First, clFlush may help to some degree then clWaitForEvents is necessary to make a synchronization point in the middle such that for example 2000 kernel invocations is enqueued and clWaitForEvents waits for the 1000th one. Device is not going to pause because we have another 1000 invocations of work pre-batched already. Then similar thing needs to be repeated again and again. This could be illustrated this way:

enqueue 999 kernel commands
while(invocations < 8000000)
{
    enqueue 1 kernel command with an event
    enqueue 999 kernel commands
    wait for the event
}

The optimal number of kernel invocations after which we should wait may be different than presented here so it needs to be worked out for the given scenario.

OpenCL enqueued kernels using lots of host memory

1 Answers