OpenMP Strange Behavior - differences in performance

Question

I want to speedup image processing code using OpenMP and I found some strange behavior in my code. I'm using Visual Studio 2019 and I also tried Intel C++ compiler with same result. I'm not sure why is the code with OpenMP in some situations much slower than in the others. For example function divideImageDataWithParam() or difference between copyFirstPixelOnRow() and copyFirstPixelOnRowUsingTSize() using struct TSize as parameter of image data size. Why is performance of boxFilterRow() and boxFilterRow_OpenMP() so different a why isn't it with different radius size in program?

I created github repository for this little testing project: https://github.com/Tb45/OpenMP-Strange-Behavior

Here are all results summarized: https://github.com/Tb45/OpenMP-Strange-Behavior/blob/master/resuts.txt

I didn't find any explanation why is this happening or what am I doing wrong. Thanks for your help.

I'm working on faster box filter and others for image processing algorithms.

typedef intptr_t int_t;

struct TSize
{
    int_t width;
    int_t height;
};



void divideImageDataWithParam(
    const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
    for (int_t y = 0; y < size.height; y++)
    {
        for (int_t x = 0; x < size.width; x++)
        {
            dst[y*dstStep + x] = src[y*srcStep + x]/param;
        }
    }
}


void divideImageDataWithParam_OpenMP(
    const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param, bool parallel)
{
    #pragma omp parallel for if(parallel)
    for (int_t y = 0; y < size.height; y++)
    {
        for (int_t x = 0; x < size.width; x++)
        {
            dst[y*dstStep + x] = src[y*srcStep + x]/param;
        }
    }
}

Results of divideImageDataWithParam():

generateRandomImageData :: 3840x2160
numberOfIterations = 100

With Visual C++ 2019:

      32bit       64bit
     336.906ms   344.251ms   divideImageDataWithParam
    1832.120ms  6395.861ms   divideImageDataWithParam_OpenMP single-thread parallel=false    
     387.152ms  1204.302ms   divideImageDataWithParam_OpenMP multi-threaded parallel=true

With Intel C++ 19:

  32bit         64bit
  15.162ms       8.927ms    divideImageDataWithParam
 266.646ms     294.134ms    divideImageDataWithParam_OpenMP single-threaded parallel=false  
 239.564ms    1195.556ms    divideImageDataWithParam_OpenMP multi-threaded parallel=true

Screenshot from Intel VTune Amplifier, where divideImageDataWithParam_OpenMP() with parallel=false take most of the time in instruction mov to dst memory.

1) remember to build with compiler optimizations enabled. 2) OpenMP (or any kind of parallelizing) is not a silver bullet, there has to be enough parallel work available in your prog to negate the overhead of creating multiple threads, doing synchronization and more. Otherwise you are just slowing things down. 3) be aware of things like priority inversions, false sharing, live locks, dead locks, race conditions, synchronization and much, much more when attempting to parallelize code. There are many potential pitfalls. 4) different vendors OpenMP library perf also differs. — Jesper Juhl
Ok, I understand, but how is it possible in this simply code with #pragma omp and if parallel set to false to get 20x more time (64bit) to process compare to code without pragma omp. It is littbe bit awkward for this simple task. — Tb45
Did you find that ☆ __restrict dst made no difference? How about comparison of optimization reports with Intel compiler and comparing Intel and Microsoft omp libraries with msvc build? — tim18
With Intel omp library I can't imagine that omp_places=cores and reducing nthreads would not show interesting results. — tim18
Keyword __restrict dst made no difference and I placed results from Intel C++ Compiler and it is still slower than single thread without OpenMP and even slower with multi-threaded OpenMP. The function divideImageDataWithParam(param = 1) is really weird. — Tb45

Gregor Budweiser Gregor Budweiser · Accepted Answer · 2019-08-02T08:50:55

648trindade is right; it has to do with optimizations that cannot be done with openmp. But its not loop-unrolling or vectorization, its inlining which allows for a smart substitution.

Let me explain: Integer divisions are incredibly slow (64bit IDIV: ~40-100 Cycles). So whenever possible people (and compilers) try to avoid divisions. One trick you can use is to substitute a division with a multiplication and a shift. That only works if the divisor is known at compile time. This is the case because your function divideImageDataWithParam is inlined and PARAM is known. You can verify this by prepending it with __declspec(noinline). You will get the timings that you expected.

The openmp parallelization does not allow this trick because the function cannot be inlined and therefore param is not known at compile time and an expensive IDIV-instruction is generated.

Compiler output of divideImageDataWithParam (WIN10, MSVC2017, x64):

0x7ff67d151480  <+  336>         movzx   ecx,byte ptr [r10+r8]
0x7ff67d151485  <+  341>         mov     rax,r12
0x7ff67d151488  <+  344>         mul     rax,rcx                  <------- multiply
0x7ff67d15148b  <+  347>         shr     rdx,3                    <------- shift
0x7ff67d15148f  <+  351>         mov     byte ptr [r8],dl
0x7ff67d151492  <+  354>         lea     r8,[r8+1]
0x7ff67d151496  <+  358>         sub     r9,1
0x7ff67d15149a  <+  362>         jne     test!main+0x150 (00007ff6`7d151480)

And the openmp-version:

0x7ff67d151210  <+  192>         movzx   eax,byte ptr [r10+rcx]
0x7ff67d151215  <+  197>         lea     rcx,[rcx+1]
0x7ff67d151219  <+  201>         cqo
0x7ff67d15121b  <+  203>         idiv    rax,rbp                  <------- idiv
0x7ff67d15121e  <+  206>         mov     byte ptr [rcx-1],al
0x7ff67d151221  <+  209>         lea     rax,[r8+rcx]
0x7ff67d151225  <+  213>         mov     rdx,qword ptr [rbx]
0x7ff67d151228  <+  216>         cmp     rax,rdx
0x7ff67d15122b  <+  219>         jl      test!divideImageDataWithParam$omp$1+0xc0 (00007ff6`7d151210)

Note 1) If you try out the compiler explorer (https://godbolt.org/) you will see that some compilers do the substitution for the openmp version too.

Note 2) As soon as the parameter is not known at compile time this optimization cannot be done anyway. So if you put your function into a library it will be slow. I'd do something like precomputing the division for all possible values and then do a lookup. This is even faster because the lookup table fits into 4-5 cache lines and L1 latency is only 3-4 cycles.

void divideImageDataWithParam(
    const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
    uint8_t tbl[256];
    for(int i = 0; i < 256; i++) {
        tbl[i] = i / param;
    }

    for (int_t y = 0; y < size.height; y++)
    {
        for (int_t x = 0; x < size.width; x++)
        {
            dst[y*dstStep + x] = tbl[src[y*srcStep + x]];
        }
    }
}

Also thanks for the interesting question, I learned a thing or two along the way! ;-)

OpenMP Strange Behavior - differences in performance

2 Answers