I am trying to learn how to utilize NEON using gcc and inline assembly. While it is confusing and slow going, I making some progress (It's been 10 years since I last tried writing assembly). My simple program loads a (small) vector, saturation sums it, and stores it. The problem I am having is that I cannot seem to store the result in the place I want. When I use an unused array pointer (r) in my output list, I get an error "impossible constraint in asm". If I then create a second pointer to it (rptr), it assembles, but it re-uses an input register r2 which is a, effectively overwriting the input. (I know my arrays are 32 elements in size and that I'm only processing one element, I plan to try to loop, or try load more registers for parallel processing next)
void vecSum()
{
//two input arrays of 32 bit types, one output
int32_t a[32];
int32_t b[32];
int32_t r[32];
//initialize
for(int cnt = 0; cnt < 32; cnt++)
{
a[cnt] = 0x33333333;
b[cnt] = 0x11111111;
r[cnt] = 0;
}
void *rptr = r;
__asm__ volatile(
"vld1.32 {d0},[%[ina]]!\n" //load the neon register with our data at a, post increment the reg
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n" //store the answer
: [result]"=r" (rptr) /*r*/
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
for(int g=0; g < 32; g++)
{
printf("0x[%d]%x ",g,a[g]);
}
}
Objdump:
for(int cnt = 0; cnt < 32; cnt++)
780: e3530080 cmp r3, #128 ; 0x80
784: 1afffff7 bne 768 <_Z8vecSum32v+0x28>
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n"
: [result]"=r" (rptr)
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
788: f422078f vld1.32 {d0}, [r2]
78c: f421178d vld1.32 {d1}, [r1]!
790: f2200011 vqadd.s32 d0, d0, d1
794: f402078f vst1.32 {d0}, [r2]
In summary, if I try vst1.32 d0,[%[result]]
where result is the array pointer r, I get a compilation error. If I rptr ( another pointer to r) it comiles, but uses r2 (the array a) as the output.
Can anybody explain why I get the error outputting to r? And why the ptr to r is a?
=r
tells, that your code produces the valuerptr
, which doesn't seem to be true. The output of the assembler is in memory; the proper constraint AFAIK would be[result]"+r" (rptr)
(assuming that eventually you need to*rptr++ = xxx
) - Aki Suihkonen