Performance degrade while using alternative for Intel intrinsics SSSE3

Question

I am developing a performance critical application which has to be ported into Intel Atom processor which just supports MMX, SSE, SSE2 and SSE3. My previous application had support for SSSE3 as well as AVX now I want to downgrade it to Intel Atom processor(MMX, SSE, SSE2, SSE3).

There is a serious performance downgrade when I replace ssse3 instruction particularly _mm_hadd_epi16 with this code

RegTemp1 = _mm_setr_epi16(RegtempRes1.m128i_i16[0], RegtempRes1.m128i_i16[2], 
                          RegtempRes1.m128i_i16[4], RegtempRes1.m128i_i16[6],
                          Regfilter.m128i_i16[0],   Regfilter.m128i_i16[2],
                          Regfilter.m128i_i16[4],   Regfilter.m128i_i16[6]);

RegTemp2 = _mm_setr_epi16(RegtempRes1.m128i_i16[1], RegtempRes1.m128i_i16[3],
                          RegtempRes1.m128i_i16[5], RegtempRes1.m128i_i16[7],
                          Regfilter.m128i_i16[1],   Regfilter.m128i_i16[3],
                          Regfilter.m128i_i16[5], Regfilter.m128i_i16[7]);

RegtempRes1 = _mm_add_epi16(RegTemp1, RegTemp2);

This is the best conversion I was able to come up with for this particular instruction. But this change has seriously affected the performance of the entire program.

Can anyone please suggest a better performance efficient alternative within MMX, SSE, SSE2 and SSE3 instructions to the _mm_hadd_epi16 instruction. Thanks in advance.

The Intel Atom processor I am using does not support SSSE3 or higher instruction sets. So, I want my application to support just SSE, SSE2 and SSE3 instruction sets. — Harrisson
@Harrisson, maybe you're just having troubles enabling SSSE3 with Atom in your compiler? Have you searched for this? Here is a discussion where someone had a problem getting SSSE3 working with GCC with Atom forum.serviio.org/viewtopic.php?f=14&t=6931 — Z boson
@Harrisson, if you want an official confirmation, ask on Intel Software Forums. But I'm sure all Atoms support SSSE3: gcc will enable SSSE3 if you specify -march=atom and the option to enable code-generation for Atom in Intel compiler is named -xATOM_SSSE3. Bay Trail is based on newer Silvermont microarchitecture and additionally supports SSE4.2. — Marat Dukhan

Z boson Z boson · Accepted Answer · 2014-02-21T11:03:09

If your goal is to take the horizontal sum of 8 16-bit values you can do this with SSE2 like this:

__m128i sum1  = _mm_shuffle_epi32(a,0x0E);             // 4 high elements
__m128i sum2  = _mm_add_epi16(a,sum1);                 // 4 sums
__m128i sum3  = _mm_shuffle_epi32(sum2,0x01);          // 2 high elements
__m128i sum4  = _mm_add_epi16(sum2,sum3);              // 2 sums
__m128i sum5  = _mm_shufflelo_epi16(sum4,0x01);        // 1 high element
__m128i sum6  = _mm_add_epi16(sum4,sum5);              // 1 sum
int16_t sum7  = _mm_cvtsi128_si32(sum6);               // 16 bit sum

Performance degrade while using alternative for Intel intrinsics SSSE3

2 Answers