NEON inline assembly - store query

Question

I am trying to learn how to utilize NEON using gcc and inline assembly. While it is confusing and slow going, I making some progress (It's been 10 years since I last tried writing assembly). My simple program loads a (small) vector, saturation sums it, and stores it. The problem I am having is that I cannot seem to store the result in the place I want. When I use an unused array pointer (r) in my output list, I get an error "impossible constraint in asm". If I then create a second pointer to it (rptr), it assembles, but it re-uses an input register r2 which is a, effectively overwriting the input. (I know my arrays are 32 elements in size and that I'm only processing one element, I plan to try to loop, or try load more registers for parallel processing next)

void vecSum()
{
    //two input arrays of 32 bit types, one output
    int32_t a[32];
    int32_t b[32];
    int32_t r[32];

    //initialize
    for(int cnt = 0; cnt < 32; cnt++)
    {
        a[cnt] = 0x33333333;
        b[cnt] = 0x11111111;
        r[cnt] = 0;
    }

    void *rptr = r;

    __asm__ volatile(
    "vld1.32 {d0},[%[ina]]!\n"  //load the neon register with our data at a, post increment the reg
    "vld1.32 {d1},[%[inb]]!\n"
    "vqadd.s32 d0,d1\n"        //perform the sat
    "vst1.32 d0,[%[result]]\n" //store the answer
    : [result]"=r" (rptr) /*r*/
    : [ina] "r" (a), [inb] "r" (b)
    : /*"d0", "d1", "d2"*/);

    for(int g=0; g < 32; g++)
    {
        printf("0x[%d]%x ",g,a[g]);
    }    

}

Objdump:

for(int cnt = 0; cnt < 32; cnt++)
 780:   e3530080    cmp r3, #128    ; 0x80
 784:   1afffff7    bne 768 <_Z8vecSum32v+0x28>
"vld1.32 {d1},[%[inb]]!\n"
"vqadd.s32 d0,d1\n" //perform the sat
"vst1.32 d0,[%[result]]\n"
: [result]"=r" (rptr)
: [ina] "r" (a), [inb] "r" (b)
: /*"d0", "d1", "d2"*/);
 788:   f422078f    vld1.32 {d0}, [r2]
 78c:   f421178d    vld1.32 {d1}, [r1]!
 790:   f2200011    vqadd.s32   d0, d0, d1
 794:   f402078f    vst1.32 {d0}, [r2]

In summary, if I try vst1.32 d0,[%[result]] where result is the array pointer r, I get a compilation error. If I rptr ( another pointer to r) it comiles, but uses r2 (the array a) as the output.

Can anybody explain why I get the error outputting to r? And why the ptr to r is a?

The constraint =r tells, that your code produces the value rptr, which doesn't seem to be true. The output of the assembler is in memory; the proper constraint AFAIK would be [result]"+r" (rptr) (assuming that eventually you need to *rptr++ = xxx) — Aki Suihkonen
Thank you, that seems to have the desired effect. Most of the examples deal with =r, indeed the (old) gcc how-to page I have doesn't list a '+' modifier. I've found the new one now. Any idea why +r works using the (rptr) but not (r) directly? — ianhobo

Timothy Baldwin Timothy Baldwin · Accepted Answer · 2015-07-29T16:44:39

rptr is declared as an output when it should be an input and "memory" is missing from the clobber list.

Alternatively you may put the arrays in structs and use the structs (rather than pointers) as arguments to the asm statement.

NEON inline assembly - store query

2 Answers