A problem with the code that you had is that casting between vector types is not a value-preserving conversion, it is a reinterpretation. So (__m256d)mm_a
actually means "take these 32 bytes and interpret them as 4 doubles". That can be OK, but if the data is packed RGB888 then reinterpreting it as doubles is not good.
Proper conversions could be used, but using floating point arithmetic (especially double precision) for this is overkill. Using smaller types makes more of them fit in a vector so that is usually faster since more items can be worked on with an instruction.
Also the 122 byte header should not be put into the aligned arrays, its presence there immediately unaligns the position of the actual pixel data. It can be written to the output file separately.
For example, one strategy for this is to widen to 16bit, use _mm256_mulhi_epu16
to scale by approximately 80% and approximately 20%, add them with _mm256_add_epi16
, then narrow to 8bit again. The unpacking to 16bit and later packing back to 8bit works a bit strangely with 256bit vectors, think of it as 2x the 128bit operation side by side. To prevent premature truncation, the 8bit source data can be unpacked with a free shift left by 8, putting the data byte in the high byte of the corresponding word. That way the multiply-high will create 16bit intermediate results, instead of truncating them to 8bit immediately, this way we can round after doing the addition which is more proper (this does cost an extra shift, and optionally an add). For example like this (not tested):
const uint16_t scale_a = uint16_t(0x10000 * 0.8);
const uint16_t scale_b = uint16_t(0x10000 - scale_a);
__m256i roundoffset = _mm256_set1_epi16(0x80);
__m256i zero = _mm256_setzero_si256();
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
// c[i] = ((a[i] << 8) * scale_a) + ((b[i] << 8) * scale_b) >> 7;
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i data_al = _mm256_unpacklo_epi8(zero, raw_a);
__m256i data_bl = _mm256_unpacklo_epi8(zero, raw_b);
__m256i data_ah = _mm256_unpackhi_epi8(zero, raw_a);
__m256i data_bh = _mm256_unpackhi_epi8(zero, raw_b);
__m256i scaled_al = _mm256_mulhi_epu16(data_al, _mm256_set1_epi16(scale_a));
__m256i scaled_bl = _mm256_mulhi_epu16(data_bl, _mm256_set1_epi16(scale_b));
__m256i scaled_ah = _mm256_mulhi_epu16(data_ah, _mm256_set1_epi16(scale_a));
__m256i scaled_bh = _mm256_mulhi_epu16(data_bh, _mm256_set1_epi16(scale_b));
__m256i suml = _mm256_add_epi16(scaled_al, scaled_bl);
__m256i sumh = _mm256_add_epi16(scaled_ah, scaled_bh);
__m256i roundedl = _mm256_srli_epi16(_mm256_add_epi16(suml, roundoffset), 8);
__m256i roundedh = _mm256_srli_epi16(_mm256_add_epi16(sumh, roundoffset), 8);
__m256i packed = _mm256_packus_epi16(roundedl, roundedh);
_mm256_storeu_si256((__m256i *)&(c[i]), packed);
}
There are quite a lot of shuffle operations in it, which limit the throughput to one iteration every 5 cycles (in the absence of other limiters), which is roughly 1 pixel (as output) per cycle.
A different strategy could be to use _mm256_maddubs_epi16
, with a lower precision approximation of the blend factors. It treats its second operand as signed bytes and does signed saturation, so this time only a 7-bit approximation of the scales fit. Since it operates on 8bit data there is less unpacking, but there is still some unpacking since it requires the data from both images to be interleaved. Maybe like this (also not tested):
const uint8_t scale_a = uint8_t(0x80 * 0.8);
const uint8_t scale_b = uint8_t(0x80 - scale_a);
__m256i scale = _mm256_set1_epi16((scale_b << 8) | scale_a);
__m256i roundoffset = _mm256_set1_epi16(0x80);
//__m256i scale = _mm256_set1_epi16();
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = ((a[i] * 0.8) + (b[i] * 0.2));
// c[i] = (a[i] * scale_a) + (b[i] * scale_b) >> 7;
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i data_l = _mm256_unpacklo_epi8(raw_a, raw_b);
__m256i data_h = _mm256_unpackhi_epi8(raw_a, raw_b);
__m256i blended_l = _mm256_maddubs_epi16(data_l, scale);
__m256i blended_h = _mm256_maddubs_epi16(data_h, scale);
__m256i roundedl = _mm256_srli_epi16(_mm256_add_epi16(blended_l, roundoffset), 7);
__m256i roundedh = _mm256_srli_epi16(_mm256_add_epi16(blended_h, roundoffset), 7);
__m256i packed = _mm256_packus_epi16(roundedl, roundedh);
_mm256_storeu_si256((__m256i *)&(c[i]), packed);
}
With only 3 shuffles, perhaps the throughput could reach 1 iteration per 3 cycles, that would be almost 1.8 pixels per cycle.
Hopefully there are better ways to do it. Neither of these approaches is close to maxing out on multiplications, which seems like it should be the goal. I don't know how to get there though.
An other strategy is using several rounds of averaging to get close to the desired ratio, but close is not that close. Maybe something like this (not tested):
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = round_somehow((a[i] * 0.8125) + (b[i] * 0.1875));
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i mixed_8_8 = _mm256_avg_epu8(raw_a, raw_b);
__m256i mixed_12_4 = _mm256_avg_epu8(raw_a, mixed_8_8);
__m256i mixed_14_2 = _mm256_avg_epu8(raw_a, mixed_12_4);
__m256i mixed_13_3 = _mm256_avg_epu8(mixed_12_4, mixed_14_2);
_mm256_storeu_si256((__m256i *)&(c[i]), mixed_13_3);
}
But _mm256_avg_epu8
rounds up, maybe it's bad to stack it so many times. There is no "avg round down"-instruction but avg_down(a, b) == ~avg_up(~a, ~b)
. That does not result in a huge mess of complements because most of them cancel each other. If there is still rounding up, it makes sense to leave that for the last operation. Always rounding down saves a XOR though. Maybe something like this (not tested)
__m256i ones = _mm256_set1_epi8(-1);
for(int i = 0; i < ARRSIZE; i += 32) {
// c[i] = round_somehow((a[i] * 0.8125) + (b[i] * 0.1875));
__m256i raw_a = _mm256_loadu_si256((__m256i *)&(a[i]));
__m256i raw_b = _mm256_loadu_si256((__m256i *)&(b[i]));
__m256i inv_a = _mm256_xor_si256(ones, raw_a);
__m256i inv_b = _mm256_xor_si256(ones, raw_b);
__m256i mixed_8_8 = _mm256_avg_epu8(inv_a, inv_b);
__m256i mixed_12_4 = _mm256_avg_epu8(inv_a, mixed_8_8);
__m256i mixed_14_2 = _mm256_avg_epu8(inv_a, mixed_12_4);
__m256i mixed_13_3 = _mm256_avg_epu8(_mm256_xor_si256(mixed_12_4, ones),
_mm256_xor_si256(mixed_14_2, ones));
_mm256_storeu_si256((__m256i *)&(c[i]), mixed_13_3);
}