7
votes

I'm trying to optimize my code using Neon intrinsics. I have a 24-bit rotation over a 128-bit array (8 each uint16_t).

Here is my c code:

uint16_t rotated[8];
uint16_t temp[8];
uint16_t j;
for(j = 0; j < 8; j++)
{
     //Rotation <<< 24  over 128 bits (x << shift) | (x >> (16 - shift)
     rotated[j] = ((temp[(j+1) % 8] << 8) & 0xffff) | ((temp[(j+2) % 8] >> 8) & 0x00ff);
}

I've checked the gcc documentation about Neon Intrinsics and it doesn't have instruction for vector rotations. Moreover, I've tried to do this using vshlq_n_u16(temp, 8) but all the bits shifted outside a uint16_t word are lost.

How to achieve this using neon intrinsics ? By the way is there a better documentation about GCC Neon Intrinsics ?

3
armcc has __ror intrinsicouah
What about using inline assembly with the ROR ARM instruction?ouah
I prefer to avoid assembly. By the way I'm using GCC so no armcc !Kami
GCC also supports ARM assembly.rsaxvc

3 Answers

6
votes

After some reading on Arm Community Blogs, I've found this :

Neon Arm Bitwise Rotation

VEXT: Extract VEXT extracts a new vector of bytes from a pair of existing vectors. The bytes in the new vector are from the top of the first operand, and the bottom of the second operand. This allows you to produce a new vector containing elements that straddle a pair of existing vectors. VEXT can be used to implement a moving window on data from two vectors, useful in FIR filters. For permutation, it can also be used to simulate a byte-wise rotate operation, when using the same vector for both input operands.

The following Neon GCC Intrinsic does the same as the assembly provided in the picture :

uint16x8_t vextq_u16 (uint16x8_t, uint16x8_t, const int)

So the the 24bit rotation over a full 128bit vector (not over each element) could be done by the following:

uint16x8_t input;
uint16x8_t t0;
uint16x8_t t1;
uint16x8_t rotated;

t0 = vextq_u16(input, input, 1);
t0 = vshlq_n_u16(t0, 8);
t1 = vextq_u16(input, input, 2);
t1 = vshrq_n_u16(t1, 8);
rotated = vorrq_u16(t0, t1);
4
votes

I'm not 100% sure but I don't think NEON has rotate instructions.

You can compose the rotation operation you require with a left shift, a right shit and an or, e.g.:

uint8_t ror(uint8_t in, int rotation)
{
    return (in >> rotation) | (in << (8-rotation));
}

Just do the same with the Neon intrinsics for left shift, right shit and or.

uint16x8_t temp;
uint8_t rot;

uint16x8_t rotated =  vorrq_u16 ( vshlq_n_u16(temp, rot) , vshrq_n_u16(temp, 16 - rot) );

See http://en.wikipedia.org/wiki/Circular_shift "Implementing circular shifts."

This will rotate the values inside the lanes. If you want to rotate the lanes themselves use VEXT as described in the other answer.

3
votes

Use vext.8 to concat a vector with itself and give you the 16-byte window that you want (in this case offset by 3 bytes).

Doing this with intrinsics requires casting to keep the compiler happy, but it's still a single instruction:

#include <arm_neon.h>

uint16x8_t byterotate3(uint16x8_t input) {
    uint8x16_t tmp = vreinterpretq_u8_u16(input);
    uint8x16_t rotated = vextq_u8(tmp, tmp, 16-3);
    return vreinterpretq_u16_u8(rotated);
}

g++5.4 -O3 -march=armv7-a -mfloat-abi=hard -mfpu=neon (on Godbolt) compiles it to this:

byterotate3(__simd128_uint16_t):
    vext.8  q0, q0, q0, #13
    bx      lr

A count of 16-3 means we left-rotate by 3 bytes. (It means we take 13 bytes from the left vector and 3 bytes from the right vector, so it's also a right-rotate by 13).


Related: x86 also has instruction that takes a sliding window into the concatenation of two registers: palignr (added in SSSE3).


Maybe I'm missing something about NEON, but I don't understand why the OP's self-answer is using vext.16 (vextq_u16), which has 16-bit granularity. It's not even a different instruction, just an alias for vext.8 which makes it impossible to use an odd-numbered count, requiring extra instructions. The manual for vext.8 says:

VEXT pseudo-instruction

You can specify a datatype of 16, 32, or 64 instead of 8. In this case, #imm refers to halfwords, words, or doublewords instead of referring to bytes, and the permitted ranges are correspondingly reduced.