Intel HD GPU vs Intel CPU Perfomance comparsion

Question

I am a newbie in OpenCL and currently have some questions about its performance.

I have Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz + ubuntu + Beignet (Intel open source openCL library see: http://arrayfire.com/opencl-on-intel-hd-iris-graphics-on-linux/ http://www.freedesktop.org/wiki/Software/Beignet/)

I have simple bench

#define __CL_ENABLE_EXCEPTIONS
#include "CL/cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

using namespace cl;
using namespace std;

void CPUadd(vector<float> & A, vector<float> & B, vector<float> & C)
{
    for (int i = 0; i < A.size(); i++)
    {
        C[i] = A[i] + B[i];
    }
}

int main(int argc, char* argv[]) {
    Context(CL_DEVICE_TYPE_GPU);
    static const unsigned elements = 1000000;
    vector<float> data(elements, 6);
    Buffer a(begin(data), end(data), true, false);
    Buffer b(begin(data), end(data), true, false);
    Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

    Program addProg(R"d(
        kernel
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx] + a[idx] * b[idx] + 5;
        }
    )d", true);

    auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");

#if 1
    for (int i = 0; i < 4000; i++)
    {
        add(EnqueueArgs(elements), a, b, c);
    }
    vector<float> result(elements);
    cl::copy(c, begin(result), end(result));
#else
    vector<float> result(elements);
    for (int i = 0; i < 4000; i++)
    {
        CPUadd(data, data, result);
    }
#endif

    //std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
}

According to my measurements Intel HD is 20x faster then single CPU (see bench above). It is seems too small to me, because in case of using 4x cores I will get only 5x speed-up on GPU. Am I wrote correct bench and speed-up seems to be realistic? Unfortunately clinfo in my case do not find CPU as OpenCL device so I can't do direct compare.

UPDATE

Measurements

$ g++ -o main main.cpp -lOpenCL -std=c++11 $ time ./main real 0m37.316s user 0m37.280s sys 0m0.016s $ g++ -o main main.cpp -lOpenCL -std=c++11 $ time ./main real 0m2.349s user 0m0.524s sys 0m0.624s

Total: 2.349 - 0.524 = 1.825 for GPU 37.316 - 0.524 = 36.724 for CPU

36.724 / 1.825 = 20.12x faster than single CPU => 5x faster than full CPU.

What are your expectations based on? As a very rough guideline, you can compare peak throughput. — void_ptr
Probably hd 's preferred float width is 8 while a single cpu core's preferred width cpu is 4. You are using scalar opencl code which may favor cpu. Make it use float8. Then ask again. — huseyin tugrul buyukisik
Measuring fastest CL device by doing vector sums of 10k elements, is like measuring faster runner by running 1m distance. There will be, memory bottleneck, IO/overhead, launch overhead, .... that makes all the measurements invalid. Also, as they pointed out, even your kernels are not equivalent. — DarkZeros
@huseyintugrulbuyukisik See above: Context(CL_DEVICE_TYPE_GPU); — Marat Zakirov

simpel01 simpel01 · Accepted Answer · 2015-12-02T19:07:51

The two implementation you are comparing are not functionally equivalent.

Your CPU implementation needs 30% less memory bandwidth (which may explain the performance). It is accessing only array A and B while the GPU kernel it is using 3 arrays a, b and c.

Intel HD GPU vs Intel CPU Perfomance comparsion

1 Answers