Block Matching optimization using x86/x64 Streaming SIMD Extension

Question

This is going to be the very first SO Question I'm posting!

    std::cout << "Hello mighty StackOverflow!" << std::endl;

I'm trying to optimize a "Block Matching" implementation for stereo-vision application using Intel's SSE4.2 and/or AVX intrinsics. I'm using "Sum of Absolute Differences" to find the best matching block. In my case blockSize will be an odd number, such as 3 or 5. This a snippet of my C++ code:

    for (int i = 0; i < rows; ++i) {
        for (int j = 0; j < cols; ++j) {
            minS = INT_MAX;
            for (int k = 0; k <= beta; ++k) {
                S = 0;
                for (int l = i; l < i + blockSize; ++l) {
                    for (int m = j; m <= j + blockSize ; ++m) {
                        // adiff(a,b) === abs(a-b)
                        S += adiff(rImage.at<uchar>(l, m), lImage.at<uchar>(l, m + k));
                    }
                }
                if (S < minS) {
                    minS = S;
                    kStar = k;
                }
            }
            disparity.at<uchar>(i, j) = kStar;
        }
    }

I know that the Streaming SIMD Extension contain many instructions to facilitate block-matching using SAD such as _mm_mpsadbw_epu8 and _mm_sad_epu8 , but they all seam to be targeting blockSizes that are 4, 16 or 32. e.g. this code from Intel. My problem is that in my application blockSize is an odd number, mostly 3 or 5.

I have considered the following starting point:

            r0 = _mm_lddqu_si128 ((__m128i*)&rImage.at<uchar>(i, j));
            l0 = _mm_lddqu_si128 ((__m128i*)&lImage.at<uchar>(i, j));
            s0 = _mm_abs_epi8 (_mm_sub_epi8 (r0 , l0) );

but from here, I don't know of a means to sum up 3 or 5 consecutive bytes from s0!

I would appreciate any thoughts on this.

Welcome to Stack Overflow! That's a well-composed question for a new member. I suspect you'll get a good answer shortly. :) — Drew Dormann
I suspect the best answer will be to simply pad your rows/columns with zeros (so that those entries contribute nothing to your final sum). — twalberg
The sum will be over 3x3 or 5x5 blocks -- That's 1+8 and 1+24 entries all together. The 8 and 24 OTOH are figures that could be handled efficiently with SIMD. — Aki Suihkonen
Yes @AkiSuihkonen. I assume a previous leveling has been run on the data. — ɹɐʎɯɐʞ

Mats Petersson Mats Petersson · Accepted Answer · 2013-04-11T16:27:45

I suspect if blocksize is as small as 3-5 bytes x 3-5 bytes, you'd get fairly little benefit from using SSE or similar instructions, because you'll spend far too much of the "gain" from doing the math quickly in "swizzling" (moving data from one place to another).

However, looking at the code, it looks like you are processing the same rImage[i, j] multiple times, which I think doesn't make sense.

Block Matching optimization using x86/x64 Streaming SIMD Extension

1 Answers