I want to speedup image processing code using OpenMP and I found some strange behavior in my code. I'm using Visual Studio 2019 and I also tried Intel C++ compiler with same result.
I'm not sure why is the code with OpenMP in some situations much slower than in the others. For example function divideImageDataWithParam()
or difference between copyFirstPixelOnRow()
and copyFirstPixelOnRowUsingTSize()
using struct TSize
as parameter of image data size. Why is performance of boxFilterRow()
and boxFilterRow_OpenMP()
so different a why isn't it with different radius size in program?
I created github repository for this little testing project: https://github.com/Tb45/OpenMP-Strange-Behavior
Here are all results summarized: https://github.com/Tb45/OpenMP-Strange-Behavior/blob/master/resuts.txt
I didn't find any explanation why is this happening or what am I doing wrong. Thanks for your help.
I'm working on faster box filter and others for image processing algorithms.
typedef intptr_t int_t;
struct TSize
{
int_t width;
int_t height;
};
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
void divideImageDataWithParam_OpenMP(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param, bool parallel)
{
#pragma omp parallel for if(parallel)
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
Results of divideImageDataWithParam():
generateRandomImageData :: 3840x2160
numberOfIterations = 100
With Visual C++ 2019:
32bit 64bit
336.906ms 344.251ms divideImageDataWithParam
1832.120ms 6395.861ms divideImageDataWithParam_OpenMP single-thread parallel=false
387.152ms 1204.302ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
With Intel C++ 19:
32bit 64bit
15.162ms 8.927ms divideImageDataWithParam
266.646ms 294.134ms divideImageDataWithParam_OpenMP single-threaded parallel=false
239.564ms 1195.556ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
Screenshot from Intel VTune Amplifier, where divideImageDataWithParam_OpenMP() with parallel=false take most of the time in instruction mov to dst memory.