GPU with OpenCL is slower than CPU. Why?

Question

Environment:

Intel i7-9750H
Intel UHD Graphics 630
Nvidia GTX1050 (Laptop)
Visual studio 2019 / C++
OpenCV 4.4
OpenCL 3.0 (intel) / 1.2 (nvidia)

I'm trying to use OpenCL to speed up my code. But the result shows CPU is faster than GPU. How could I speed up my code?

void GetHoughLines(cv::Mat dst) {
    cv::ocl::setUseOpenCL(true);

    int img_w = dst.size().width; // 5000
    int img_h = dst.size().height; // 4000

    cv::UMat tmp_dst = dst.getUMat(cv::ACCESS_READ);
    cv::UMat tmp_mat = cv::UMat(dst.size(), CV_8UC1, cv::Scalar(0));

    for (size_t i = 0; i < 1000; i++)
    {
        tmp_mat = tmp_mat.mul(tmp_dst);
    }
}

It took about 3000ms when I used only CPU. When I used Intel UHD Graphics 630, it took 3500ms. And I also tried GTX1050, but it took about 3000ms.

Please give me some ideas to speed it up. I should make it at least 1000ms. Should I use AMP or OpenMP? But as I know, they can only compute simple operations, not suitable for OpenCV functions.

Calling that a graphics card is really generous, that's why. Intel's embedded GPU is notoriously slow, as in it's the bare minimum you can put on a chip to get actual graphics on the screen, not much more. You'll get much better results on a discrete GPU. A mid-range AMD or NVidia card will likely perform significantly better. — tadman
@tadman Thank you for your reply. As you comment, I tried to use Nvidia GPU. But it is still slow. Do you have other idea to speed up? — Soonmyun Jang
Get a faster GPU. Consider using a GPU-on-demand service, like a cloud-hosted option, if you just need to do a few quick runs. Look at any algorithmic improvements you can make. Use a profiler to find out if there's any optimizations you can make. See if you really need 1000 iterations. Etc. — tadman
I don't know exactly how open cv implemented open CL but i imagine your matrix has to be copied to and from the gpu for each loop iteration which will be inefficient, manually written open CL is likely to be faster — Alan Birtles
Thank you for advice @tadman, AlanBirtles, pmdj I will find other ways. — Soonmyun Jang

Elad Maimoni Elad Maimoni · Accepted Answer · 2020-11-20T15:57:36

Basically, Your code is slow because the way OpenCV uses OpenCL is inefficient. It has nothing to do with the underlying hardware.

In order for OpenCL code (or any GPU related code for that matter) to be efficient, it is crucial for the host side code to properly utilize the GPU. To name a few principles:

Saturate the GPU by asynchronously enqueuing many computations (kernels).
Avoid unnecessary synchronizations.
Avoid unnecessary memory copies between host CPU and GPU device.

Even if you write the most optimized GPU kernels, but fail to adhere to these basics, you are very unlikely to gain any performance boosts.

The OpenCV codebase is a great example of how not to adhere to these principles.

As for your example, if you rewrite your code to avoid memory copies and use device memory explicitly, you might witness a reasonable performance:

auto frame1 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
auto frame2 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
auto frame3 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);

for (size_t i = 0; i < 10; i++)
{
    cv::multiply(frame1, frame2, frame3);
}

But in any case, I recommend you learn using the OpenCL API without OpenCV.

GPU with OpenCL is slower than CPU. Why?

1 Answers