speedup when using float4, opencl

Question

I have the following opencl kernel function to get the column sum of a image.

__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
                            int srcRows,int srcStep,int dstStep)   
{

    const int x = get_global_id(0);
    srcStep >>= 2;
    dstStep >>= 2;

    if (x < srcCols)
    {
        int srcIdx = x ;
        int dstIdx = x ;

        float sum = 0;

        for (int y = 0; y < srcRows; ++y)
        {
            sum += src[srcIdx];
            dst[dstIdx] = sum;
            srcIdx += srcStep;
            dstIdx += dstStep;
        }
    }
}

I assign that each thread process a column here so that a lot of threads can get the column_sum of each column in parallel.

I also use float4 to rewrite the above kernel so that each thread can read 4 elements in a row at one time from the source image, which is shown below.

__kernel void columnSum(__global float* src,__global float* dst,int srcCols,
                            int srcRows,int srcStep,int dstStep)
{

    const int x = get_global_id(0);

    srcStep >>= 2;
    dstStep >>= 2;
    if (x < srcCols/4)
    {
        int srcIdx = x ;
        int dstIdx = x ;

        float4 sum = (float4)(0.0f, 0.0f, 0.0f, 0.0f);

        for (int y = 0; y < srcRows; ++y)
        {
            float4 temp2;
            temp2 = vload4(0, &src[4 * srcIdx]);
            sum = sum + temp2;

            vstore4(sum, 0, &dst[4 * dstIdx]);

            srcIdx += (srcStep/4);
            dstIdx += (dstStep/4);
        }
    }
}

In this case, theoretically, I think the time consumed by the second kernel to process a image should be 1/4 of the time consumed by the first kernel function. However, no matter how large the image is, the two kernels almost consume the same time. I don't know why. Can you guys give me some ideas? T

Then the bottleneck is not that information sharing part. Is the performance acceptable besides that you don't get the 1/4 speedup? I'm asking this to see if you may have some synchronization problem. How much computation do you perform? The simple summation is just an example? — Csaba Toth
You know about SIMD instructions, right? They can do the same operation on 2, 4, 8, or 16 (or more, depending on the hardware) values simultaneously. That could be why the time is the same, because it's effectively doing the same amount of work, but using SIMD instructions. This is very prevalent in GPU's, but can also be observed on CPU's to a lesser extent. 4 floats is good for an SSE register, sounds about right. That, and you could just be bottlenecking on the memory transfer as well, as Csaba notes. — Thomas
Well, the reason I use float4 is like this: assuming we have 256 threads which can run simultaneously. We have a 1024*1024 image. If we use the first kernel, the 256 threads first process the first 256 columns, then process the next 256 columns, and so on, until all the column are finised. In this case, 4*t will be consumed, where t stands for the time consumed by the 256 threads to process 256 columns.However,if we use the second kernel, since each thread processes 4 elements in a row of the image,the 256 threads can process 1024 columns with the time of t.This is just an estimation. @Thomas — user2326258
What I'm saying is that on hardware with sufficiently wide SIMD registers, adding/multiplying two float4's takes the same time as adding/multiplying two floats. So the fact that the two kernels take about the same time shouldn't be surprising in this sense. — Thomas
I don't know whether my estimation is correct. Actually I used float4 for another kernel I wrote and it actually makes some speedup. So I tried float4 for the kernel I mentioned in this post. But I cannot get any speedup. — user2326258

Paul Jurczak Paul Jurczak · Accepted Answer · 2014-05-02T05:11:04

OpenCL vector data types like float4 were fitting better the older GPU architectures, especially AMD's GPUs. Modern GPUs don't have SIMD registers available for individual work-items, they are scalar in that respect. CL_DEVICE_PREFERRED_VECTOR_WIDTH_* equals 1 for OpenCL driver on NVIDIA Kepler GPU and Intel HD integrated graphics. So adding float4 vectors on modern GPU should require 4 operations. On the other hand, OpenCL driver on Intel Core CPU has CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT equal to 4, so these vectors could be added in a single step.

speedup when using float4, opencl

5 Answers