I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t
).
Here is my c code:
uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
//Rotation <<< 24 over 128 bits (x << shift) | (x >> (16 - shift)
rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}
I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8)
but all the bits shifted outside a uint16_t
word are lost.
How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?
armcc
has__ror
intrinsic – ouahROR
ARM instruction? – ouah