How OpenCL memory transfer functions work?

Question

I have a couple of questions, related to OpenCL memory transfer functions. I faced many questions, related to this, but to none of them extended answers were given. Probably we can collect here the overall answer.

This is my current view on three current ways of moving data:

1) enqueueReadBuffer/enqueueWriteBuffer - these two functions always copy the content of the buffer, created on the host, to the device, and from the device. No pinned memory and no DMA mechanism are used here.

2) enqueueMigrateMemObjects - this is sometimes described as an alternative to enqueueRead/Write, but in this case, memory is copied exactly at the time of this function call. No pinned memory and no DMA mechanism are used here.

3) enqueueMapBuffer/enqueueUnmapBuffer - here always pinned memory and DMA mechanism are used.

This function uses two types of buffers: created with CL_MEM_USE_HOST_PTR flag or CL_MEM_ALLOC_HOST_PTR flag. With the first one, we map an array, created on the host, to the array, created on the device. With the second array is allocated on the device and maps it to the newly created array on the host.

This is what I can state according to the documentation. I ran several tests but only saw that migration function is faster than reading/writing. Regarding these three paragraphs I have some questions:

1) If these functions do only copying, then why here https://software.intel.com/en-us/forums/opencl/topic/509406 people talk about pinning/unpinning memory during reading/writing? Under which conditions do these functions use pinned memory? Or this is just the feature of intel implementation, where ALL memory transfer related functions use pinned memory and DMA?

Also, does it mean, that if I use pinned memory, then the DMA mechanism will work? And vice versa - if I want to have DMA working, I need pinned memory?

2) Is this migration function - exactly what happens inside enqueueRead/WriteBuffer functions without some additional overhead, which these enqueuRead/writeBuffer functions give? Does it always just copy or also does DMA transfer?

For some reasons, some sources when talking about DMA transfer, use "copy", "memory", "migration" word for transferring the data between two buffers ( on the host and on the device). However, there cannot be any copy, we just write directly to the buffer without any copy at all. How should I treat this write during DMA?

What will happen, if I will use enqueueMigrateMemOjects with buffers, created with flag CL_MEM_USE_HOST_PTR?

3) With these two functions, there is total confusion. How the mapping and reading/writing will happen, if I use: a) existing host pointer or b) newly allocated host pointer?

Also here I do not properly understand how the DMA works. If I mapped my buffer on the host side to the buffer on the device side, with the help of which functions the memory is transferred between them? Should I always unmap my buffer after?

There is no explanation anywhere for this, like:" When we create a new buffer with this flag and use this memory function transfer, the data is transferred this way and such features as... are used. If the memory is created as read-only, this happens, if the memory if write only - this".

Maybe there is already a good guide for this, but from the OpenCL specification, I cannot answer my questions.

huseyin tugrul buyukisik huseyin tugrul buyukisik · Accepted Answer · 2019-09-09T15:31:17

1) DMA is used in all data transfer commands, but it works only on pinned memory regions. Otherwise OS would just page it out and give false data.

In enqueueReadBuffer/enqueueWriteBuffer, data first goes to an internal buffer which is already pinned (or pinned just in time, I don't know) then it goes to GPU memory using DMA. Because GPU can send or receive data in pages like sizes of integer-multiple of 4096 with starting address of multiple of 4096 etc (depends on vendor's alignment rules). ReadBuffer and WriteBuffer can be slower since it does two copies, one from host array to internal array then internal pinned array to gpu buffer.

In enqueueMapBuffer/enqueueUnmapBuffer, it can directly do DMA transfers efficiently because it flushes pages that are only used from host side (where you write to your host array but since it is mapped, it is uploaded to a gpu buffer) because it temporarily pins the region, only once.

Using CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR can only be an optimization to skip the internal (pinned)array copying step but make sure they fit the requirements. It is not guaranteed to get pinned array always, they are a rather scarce resource. Pinned means that OS will not page it out anytime. You may also just "touch" to first byte of a page(4kB?) region of a buffer to fake "pinning" (before calling data transfer functions) but it is not legal and may not always work. But I observed speedups in my OpenCL applications with just giving a good offset and a good size to copy(like 4k aligned and 64k sized) on USE_HOST_PTR flagged buffers. (tried only on AMD, NVIDIA GPUs and Intel IGPU but can't say anything about a Xeon Phi device)

You need pinned memory only to skip extra copy overheads. You need map/unmap to optimize pinning.

2) Migration function migrates a GPU buffer's ownership to another GPU. If you use it on same GPU, it shouldn't be making anything useful besides copying itself to RAM then copying itself back again. If you use CL_MIGRATE_MEM_OBJECT_- CONTENT_UNDEFINED then it doesn't copy data. Just moves the ownership to other device. Then you can do the data copying yourself(if you mean the wanted data is on the host, not the source device)

Some implementations copy its data through directly pci-e (I don't know if this uses GPU1 DMA to GPU2 DMA but I guess yes) and some implementations go through RAM using double DMA operations (but it can be optimized by pipelining to some extent?)

What will happen, if I will use enqueueMigrateMemOjects with buffers, created with flag CL_MEM_USE_HOST_PTR?

I didn't try but guess that it will move only the ownership since data is only on host. Also the Intel link you gave includes someone saying that "migration triggers DMA operation". Maybe two Intel Xeon Phi's, by default, communicate with DMAs rather than going through system RAM, when migration between two Phis is used.

3) CL_MEM_USE_HOST_PTR is meant to work with your application's pointers. They can be anything, even a pinned one(but outside of OpenCL's internal rules, which may be not good always). CL_MEM_ALLOC_HOST_PTR is meant to use OpenCL implementation's pinned memory, if it can. If you use only CL_MEM_READ_WRITE then it is on the device memory. Working with CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR means OpenCL kernel will do zero-copy access if it is sharing RAM directly with CPU(if device is iGPU, for example). Without pinning, extra copy. With pinning, iGPU does no copy. Very big difference. But for a discrete GPU(or a compute card like Xeon Phi), it could be somewhere between 1.0x and 2.0x the speed of extra copy version (considering host-to-host and host-to-device copies have similar bandwidths).

Mapping means host-side mapping. Host sees device memory "mapped" to its own. So device can't access it(by kernel for example). Writing means host-side writing. Host writes to GPU memory. Reading means host reading. Host reads from GPU memory. So, map unmap happens like this:

CL_MAP_WRITE_INVALIDATE_REGION version:

you map a host pointer (returned by map command) to device buffer
- now buffer ownership is on host
use a GPU buffer as if it is this host pointer
unmap the region so it flushes latest bits to GPU memory (or, no-op if it is iGPU!!)
- now buffer ownership is on device

there is also another (CL_MAP_WRITE) usage

you prepare the data beforehand
map (with initial copy enabled) (gets all data) // extra overhead
optional element "updates"
unmap
now data (with optional updated) is on GPU memory
so that OpenCL kernel function can use it as a parameter

Between map and unmap, it can flush any host input you feed there into GPU memory because it pins whole region at once. Otherwise it would (as in enqueueWriteBuffer) need to pin-unpin any current data(like a single integer) (with its whole page) being sent to GPU and it would be slow.

When you copy a 1GB buffer, memory consumption doesn't reach up to 2GB. It handles the pin-unpin-extra-copy operations on a smaller internal array, something like 64kB but depends on vendor implementation of OpenCL.

But on enqueueWriteBuffer, it uses an inner-buffer to do necessary copying and also pinning. Both pinning and unpinning and doing extra copying makes enqueueWriteBuffer slower but you can still try giving it a properly aligned host pointer and properly sized region to do the copy faster. At least this would let it do the pinning only once for whole array in the background. Perhaps even skips the extra copy if implementation has the optimization.

How OpenCL memory transfer functions work?

1 Answers