4
votes

I have the following opencl kernel function to get the column sum of a image.

__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
                            int srcRows,int srcStep,int dstStep)   
{

    const int x = get_global_id(0);
    srcStep >>= 2;
    dstStep >>= 2;

    if (x < srcCols)
    {
        int srcIdx = x ;
        int dstIdx = x ;

        float sum = 0;

        for (int y = 0; y < srcRows; ++y)
        {
            sum += src[srcIdx];
            dst[dstIdx] = sum;
            srcIdx += srcStep;
            dstIdx += dstStep;
        }
    }
}

I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.

I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.

__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
                            int srcRows,int srcStep,int dstStep)
{

    const int x = get_global_id(0);

    srcStep >>= 2;
    dstStep >>= 2;
    if (x < srcCols/4)
    {
        int srcIdx = x ;
        int dstIdx = x ;

        float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

        for (int y = 0; y < srcRows; ++y)
        {
            float4 temp2;
            temp2 = vload4(0, &src[4 * srcIdx]);
            sum = sum + temp2;

            vstore4(sum, 0, &dst[4 * dstIdx]);

            srcIdx += (srcStep/4);
            dstIdx += (dstStep/4);
        }
    }
}

In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T

5
Then the bottleneck is not that information sharing part. Is the performance acceptable besides that you don't get the 1/4 speedup? I'm asking this to see if you may have some synchronization problem. How much computation do you perform? The simple summation is just an example? - Csaba Toth
You know about SIMD instructions, right? They can do the same operation on 2, 4, 8, or 16 (or more, depending on the hardware) values simultaneously. That could be why the time is the same, because it's effectively doing the same amount of work, but using SIMD instructions. This is very prevalent in GPU's, but can also be observed on CPU's to a lesser extent. 4 floats is good for an SSE register, sounds about right. That, and you could just be bottlenecking on the memory transfer as well, as Csaba notes. - Thomas
Well, the reason I use float4 is like this: assuming we have 256 threads which can run simultaneously. We have a 1024*1024 image. If we use the first kernel, the 256 threads first process the first 256 columns, then process the next 256 columns, and so on, until all the column are finised. In this case, 4*t will be consumed, where t stands for the time consumed by the 256 threads to process 256 columns.However,if we use the second kernel, since each thread processes 4 elements in a row of the image,the 256 threads can process 1024 columns with the time of t.This is just an estimation. @Thomas - user2326258
What I'm saying is that on hardware with sufficiently wide SIMD registers, adding/multiplying two float4's takes the same time as adding/multiplying two floats. So the fact that the two kernels take about the same time shouldn't be surprising in this sense. - Thomas
I don't know whether my estimation is correct. Actually I used float4 for another kernel I wrote and it actually makes some speedup. So I tried float4 for the kernel I mentioned in this post. But I cannot get any speedup. - user2326258

5 Answers

7
votes

OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.

4
votes

You are directly reading the values from "src" array (global memory). Which typically is 400 times slower than private memory. Your bottleneck is definitelly the memory access, not the "add" operation itself.

When you move from float to float4, the vector operation (add/multiply/...) is more efficient thanks to the ability of the GPU to operate with vectors. However, the read/write to global memory remains the same. And since that is the main bottleneck, you will not see any speedup at all.

If you want to speed your algorithm, you should move to local memory. However you have to manually resolve the memory management, and the proper block size.

2
votes

which architecture do you use?

Using float4 has higher instruction level parallelism (and then require 4 times less threads) so theoretically should be faster (see http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf)

However did i understand correctly in you kernel you are doing prefix-sum (you store the partial sum at every iteration of y)? If so, because of the stores the bottleneck is at the memory writes.

0
votes

I think on the GPU float4 is not a SIMD operation in OpenCL. In other words if you add two float4 values the sum is done in four steps rather than all at once. Floatn is really designed for the CPU. On the GPU floatn serves only as a convenient syntax, at least on Nvidia cards. Each thread on the GPU acts as if it is scalar processor without SIMD. But the threads in a warp are not independent like they are on the CPU. The right way to think of the GPGPU models is Single Instruction Multiple Threads (SIMT). http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

Have you tried running your code on the CPU? I think the code with float4 should run quicker (potentially four times quicker) than the scalar code on the CPU. Also if you have a CPU with AVX then you should try float8. If the float4 code is faster on the CPU than float8 should be even faster on a CPU with AVX.

0
votes

try to define __ attribute __ to kernel and see changes in run timing for example try to define:

__ kernel void __ attribute__((vec_type_hint(int)))

or

__ kernel void __ attribute__((vec_type_hint(int4)))

or some floatN as you want

read more: https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/functionQualifiers.html