3
votes

I am a newbie in OpenCL and currently have some questions about its performance.

I have Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz + ubuntu + Beignet (Intel open source openCL library see: http://arrayfire.com/opencl-on-intel-hd-iris-graphics-on-linux/ http://www.freedesktop.org/wiki/Software/Beignet/)

I have simple bench

#define __CL_ENABLE_EXCEPTIONS
#include "CL/cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>

using namespace cl;
using namespace std;

void CPUadd(vector<float> & A, vector<float> & B, vector<float> & C)
{
    for (int i = 0; i < A.size(); i++)
    {
        C[i] = A[i] + B[i];
    }
}

int main(int argc, char* argv[]) {
    Context(CL_DEVICE_TYPE_GPU);
    static const unsigned elements = 1000000;
    vector<float> data(elements, 6);
    Buffer a(begin(data), end(data), true, false);
    Buffer b(begin(data), end(data), true, false);
    Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));

    Program addProg(R"d(
        kernel
        void add(   global const float * restrict const a,
                    global const float * restrict const b,
                    global       float * restrict const c) {
            unsigned idx = get_global_id(0);
            c[idx] = a[idx] + b[idx] + a[idx] * b[idx] + 5;
        }
    )d", true);

    auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");

#if 1
    for (int i = 0; i < 4000; i++)
    {
        add(EnqueueArgs(elements), a, b, c);
    }
    vector<float> result(elements);
    cl::copy(c, begin(result), end(result));
#else
    vector<float> result(elements);
    for (int i = 0; i < 4000; i++)
    {
        CPUadd(data, data, result);
    }
#endif

    //std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
}

According to my measurements Intel HD is 20x faster then single CPU (see bench above). It is seems too small to me, because in case of using 4x cores I will get only 5x speed-up on GPU. Am I wrote correct bench and speed-up seems to be realistic? Unfortunately clinfo in my case do not find CPU as OpenCL device so I can't do direct compare.

UPDATE

Measurements

$ g++ -o main main.cpp -lOpenCL -std=c++11 $ time ./main real 0m37.316s user 0m37.280s sys 0m0.016s $ g++ -o main main.cpp -lOpenCL -std=c++11 $ time ./main real 0m2.349s user 0m0.524s sys 0m0.624s

Total: 2.349 - 0.524 = 1.825 for GPU 37.316 - 0.524 = 36.724 for CPU

36.724 / 1.825 = 20.12x faster than single CPU => 5x faster than full CPU.

1
What are your expectations based on? As a very rough guideline, you can compare peak throughput.void_ptr
Probably hd 's preferred float width is 8 while a single cpu core's preferred width cpu is 4. You are using scalar opencl code which may favor cpu. Make it use float8. Then ask again.huseyin tugrul buyukisik
Measuring fastest CL device by doing vector sums of 10k elements, is like measuring faster runner by running 1m distance. There will be, memory bottleneck, IO/overhead, launch overhead, .... that makes all the measurements invalid. Also, as they pointed out, even your kernels are not equivalent.DarkZeros
@huseyintugrulbuyukisik See above: Context(CL_DEVICE_TYPE_GPU);Marat Zakirov

1 Answers

1
votes

The two implementation you are comparing are not functionally equivalent.

Your CPU implementation needs 30% less memory bandwidth (which may explain the performance). It is accessing only array A and B while the GPU kernel it is using 3 arrays a, b and c.