I am executing monte carlo sweeps on a population of replicas of my system using OpenCL kernels. After the initial debugging phase I increased some of the arguments to more realistic values and noticed that the program is suddenly eating up large amounts of host memory. I am executing 1000 sweeps on about 4000 replicas, each sweep consists of 2 kernel invocations. That results in about 8 million kernel invocations.
The source of the memory usage was easy to find (see screenshot).
- While the the kernel executions are enqueued the memory usage goes up.
- While the kernels are executing the memory usage stays constant.
- Once the kernels finish up the usage goes down to its original state.
- I did not allocate any memory, as can be seen in the memory snapshots.
That means the OpenCL driver is using the memory. I understand that it must keep a copy of all the arguments to the kernel invocations and also the global and local workgroup size, but that does not add up.
The peak memory usage was 4.5GB. Before enqueuing the kernels about 250MB were used. That means OpenCL used about 4.25GB for 8 million invocations, i.e. about half a kilobyte per invocation.
So my questions are:
- Is that kind of memory usage normal and to be expected?
- Are there good/known techniques to reduce memory usage?
- Maybe I should not enqueue so many kernels simultaneously, but how would I do that without causing synchronization, e.g. with
clFinish()
?
clFlush
every 100 or 1000 kernel invocation or even more rarely if you observe it hurts the performance. It is a non blocking command which will issue all previously queued commands to the device. – doqtorclFlush
after every 1000 kernel invocations and that already improved it a bit. Peak memory usage went down to 3.5GB and the peak is much shorter, i.e. the downward slope begins much earlier but is not as steep. The impact on performance is either small or non-existant, I can't see any. – GigoclWaitForEvents
. Enqueue 1000 kernels adding an event on the last one, then enqueue another 999 kernels and callclWaitForEvents
. This will make it wait for 1000th kernel to finish whilst there will be another 999 kernels already in the queue. Then repeat the whole thing similarly. – doqtorclFlush
before theclWaitForEvents
make any sense? I did not notice any difference. – GigoclFlush
just beforeclWaitForEvents
rather not because I think it's done byclWaitForEvents
anyway. If you want to get memory down more that 750MB I'd try decreasing the number after which you wait for the kernels to finish, for example to 500. This number needs to be worked out so that is optimal for your scenario. I'm going to put what we did here as an answer, accept it if you are happy. – doqtor