Why the data downloading is much slower than the uploading on GPU by using OpenCL?

Question

I'm a beginner of OpenCL for image processing, I use Win7+VS2010+OpenCL2.0+OpenCV247. The platform in my PC is intel i7 CPU + NvidIA GTX760.

Here is my work:

I used opencv to read image(1920*1080) from video, then copy image data and get the data pointer.
```
uchar* input_data=(uchar*)(gray_image->imageData);
```
Then I want do some convolution and other image processing works on GPU, so I used OpenCL to upload this data(input_data) to the device memory(cl_input_data) which has been created before. The uploading step takes about 0.2ms, it is fast.
```
clEnqueueWriteBuffer(queue, cl_input_data, 1,
    0, ROI_size*sizeof(cl_uchar), (void*)input_data, 0, 0, NULL);
```
The main processing works on several kernels, and each of them takes less than 0.1ms which are all quite normal.
```
clEnqueueNDRangeKernel( queue,kernel_box,2,NULL,global_work_size,local_work_size, 0,NULL, NULL);
```
After all the processing, I want to download the GPU memory(cl_output_data) to host(output_data), and this step it takes over 5.5ms! Which is nearly 27 times slower than the data uploading step!
```
clEnqueueReadBuffer( queue,cl_output_data,CL_TRUE,0,ROI_size * sizeof(char),(void*) output_data,0, NULL, NULL );
```

So, I'm just wondering, since I used the same device and the data size was exactly the same, why the uploading and downloading data's time is so different?

Oh, by the way, the time testing tool I used is something like QueryPerformanceFrequency(&m_Frequency);

Thank you!

The short answer is that GPUs are designed assuming that data goes from the rest of the computer, to the GPU, then to the display. While it's obviously possible to get data back from the GPU to the rest of the computer, it's not how it's really designed to work, so it's a lot slower. — Jerry Coffin
Thanks for answering!!! So you mean it isn't an abnormal situation, right? However, I remembered the bandwidth of upload and download is almost the same. It is also puzzled for me... — David Ding
Offhand, I don't remember exact numbers, so I'm not sure it's exactly normal, but as a general idea: yes, it's normal that reading back from the GPU is quite a bit slower than writing to it. — Jerry Coffin

jet47 jet47 · Accepted Answer · 2013-12-30T08:09:25

As I remember, clEnqueueNDRangeKernel is asynchronous call. It will return control without synchronization with device. So, when you measure time of clEnqueueNDRangeKernel, it is just a time of launch, not of processing. clEnqueueReadBuffer forces device synchronization and waits until all previous kernel call will finish. Thus, your 5.5 ms includes kernels execution time.

Why the data downloading is much slower than the uploading on GPU by using OpenCL?

1 Answers