NEON Assembly
I am trying to understand the arm-v8 NEON. Let me tell an example what I am trying to do.
I load 16 Bytes (pixels in uchar) from array A. Now I want to try "lengthening ADD" to ushort. From the documentation, I see UADDL and UADDL will do lengthening add for lower half and upper half of the source registers respectively. I could write following code to get it working:
ld1 {V10.16B}, [x0]
uaddl V11.8H, V10.8B, V10.8B
uaddl2 V12.8H, V10.16B, V10.16B
st1 {V11.8H}, [x1], #16
st1 {V12.8H}, [x1], #16
NEON Intrinsics
Coming to NEON Intrinsics, Syntax is as follows: (Refer Page 8)
uint16x8_t vaddl_u8 (uint8x8_t a, uint8x8_t b)
uint16x8_t vaddl_high_u8 (uint8x16_t a, uint8x16_t b)
Here, input to both the functions are of different types.
So once I load a uint8x16_t variable, how am I supposed to pass this variable to vaddl_u8? Is there any casting that can I do? Or do I have to copy the lower half to another variable? (That means, it is an extra cost)
So my question is, how can I implement this piece of assembly code with NEON intrinsics?
UPDATE
- I am using aarch64-linux-gnu-g++ (gcc version 5.4.0) in Ubuntu 16.04.
uint8x16_t
touint8x8_t
for free, right? With a cast intrinsic, I think. Do that for the low half, and it should compile to the asm you'd hope for. – Peter Cordes