Converting between SSE and NEON Intrinsics-Shuffling

Question

I am trying to convert a code written in SSE3 intrinsics to NEON SIMD and am stuck because of a shuffle function.I have looked at the GCC Intrinsics ,ARM manuals and other forums but have not been able to find a solution.

CODE:

_m128i upper = _mm_loadu_si128((__m128i*)p1);

register __m128i mask1 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1);
register __m128i mask2 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1,0x80);
__m128i temp1_upper = _mm_or_si128(_mm_shuffle_epi8(upper,mask1),_mm_shuffle_epi8(upper,mask2));

Though the vtbl1_u8(uint8x8_t,uint8x8_t) instruction creates a lookup table which can be used to assign values to a destination register,It only operates on 64-bit registers .Also the shuffle operation performs a comparison in the starting which has to be done in NEON and I do not know how to do that efficiently.

r0 = (mask0 & 0x80) ? 0 : SELECT(a, mask0 & 0x0f) // SELECT(a,n) extracts nth 8-bit parameter from a.

r1 = (mask1 & 0x80) ? 0 : SELECT(a, mask1 & 0x0f)

...

I cannot find an instruction which first checks the high bit of mask and then selects the lower 4-bits of the mask efficiently.I know that we can compare each bit in the register and then select lower 4 bits if the condition is specified ,But I was hoping to do it efficiently.Hope someone can help or provide a reference.

Thanks a lot,

Cheers!

Jake 'Alquimista' LEE Jake 'Alquimista' LEE · Accepted Answer · 2011-11-01T09:56:29

VTBL returns 0 when the index is out of range.

Since it supports up to two Q registers as the lookup table, it would be quite simple :

load the lookup table into a Q register (Q8 for example)
vtbl.8 d0, {q8}, d0 (where d0 contains your mask)

That will do the trick.

If you want the bits 4~6 to stay out of the way, you can mask them out prior to vtbl.

Unfortunately, VBIC is absolutely useless for 8bit immediate.

Therefore, you have to sacrifice a register initialized as the bit mask operand.

vmov.u8, d1, #0x70
load the lookup table into a Q register (Q8 for example)
vbic.i8 d0, d0, d1
vtbl.8 d0, {q8}, d0 (where d0 contains your mask)

Converting between SSE and NEON Intrinsics-Shuffling

2 Answers