This is going to be the very first SO Question I'm posting!
std::cout << "Hello mighty StackOverflow!" << std::endl;
I'm trying to optimize a "Block Matching" implementation for stereo-vision application using Intel's SSE4.2 and/or AVX intrinsics. I'm using "Sum of Absolute Differences" to find the best matching block. In my case blockSize will be an odd number, such as 3 or 5. This a snippet of my C++ code:
for (int i = 0; i < rows; ++i) {
for (int j = 0; j < cols; ++j) {
minS = INT_MAX;
for (int k = 0; k <= beta; ++k) {
S = 0;
for (int l = i; l < i + blockSize; ++l) {
for (int m = j; m <= j + blockSize ; ++m) {
// adiff(a,b) === abs(a-b)
S += adiff(rImage.at<uchar>(l, m), lImage.at<uchar>(l, m + k));
}
}
if (S < minS) {
minS = S;
kStar = k;
}
}
disparity.at<uchar>(i, j) = kStar;
}
}
I know that the Streaming SIMD Extension contain many instructions to facilitate block-matching using SAD such as _mm_mpsadbw_epu8 and _mm_sad_epu8 , but they all seam to be targeting blockSizes that are 4, 16 or 32. e.g. this code from Intel. My problem is that in my application blockSize is an odd number, mostly 3 or 5.
I have considered the following starting point:
r0 = _mm_lddqu_si128 ((__m128i*)&rImage.at<uchar>(i, j));
l0 = _mm_lddqu_si128 ((__m128i*)&lImage.at<uchar>(i, j));
s0 = _mm_abs_epi8 (_mm_sub_epi8 (r0 , l0) );
but from here, I don't know of a means to sum up 3 or 5 consecutive bytes from s0!
I would appreciate any thoughts on this.