`ldm/stm` in gcc inline ARM assembly

Question

I am trying to create an ldm (resp. stm) instruction with inline assembly but have problems to express the operands (especially: their order).

A trivial

void *ptr;
unsigned int a;
unsigned int b;

__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));

does not work because it might put a into r1 and b into r0:

ldm ip!, {r1, r0}

ldm expects registers in ascending order (as they are encoded in a bitfield) so I need a way to say that the register used for a is lower than this of b.

A trivial way is the fixed assignment of registers:

register unsigned int a asm("r0");
register unsigned int b asm("r1");

__asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));

But this removes a lot of flexibility and might make the generated code not optimal.

Does gcc (4.8) support special constraints for ldm/stm? Or, are there better ways to solve this (e.g. some __builtin function)?

EDIT:

Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words). Pseudo code is

asm("ldm  %[in]!,{ %[a],%[b],%[c],%[d] }" ...)
asm("ldm  %[in]!,{ %[e],%[f],%[g],%[h] }" ...) /* splitting of ldm generates better code;
                                                  gcc gets out of registers else */
/* do some arithmetic on a - h */

asm volatile("stm  %[out]!,{ %[a],%[b],%[c],%[d],%[e] }" ...)

Speed matters here and ldm is 50% faster than ldr. The arithmetic is tricky and because gcc generates much better code than me ;) I would like to solve it in inline assembly with giving some hints about optimized memory access.

Have you seen this? gcc.gnu.org/ml/gcc-help/2007-04/msg00092.html — auselen
@auselen thx for the link; it is exactly the problem I am describing. But post is from 2007 and perhaps something has been changed since then? — ensc
I bet not. "it would require at least a partial rewrite of gcc's register allocator." gcc.gnu.org/ml/gcc-help/2007-04/msg00109.html — auselen
Best would be to solve your problem at a higher level and be register agnostic. — auselen

artless noise artless noise · Accepted Answer · 2013-12-17T17:33:16

I have recommended the same solution in ARM memtest. Ie, explicitly assign the registers. The analysis on gcc-help is wrong. There is no need to re-write GCC's register allocation. The only thing that is needed is to allow the ordering of registers in an assembler specification.

That said the following will assemble,

int main(void)
{
    void *ptr;
    register unsigned int a __asm__("r1");
    register unsigned int b __asm__("r0");

    __asm__("ldm %0!,{%1,%2}" : "+&r"(ptr), "=r"(a), "=r"(b));
    return 0;
}

This will not compile as there is an illegal ARM instruction, ldm r3!,{r1,r0} in my gcc. A solution is to use the -S flag to assemble only and then run a script that will order the ldm/stm operands. Perl can easily do this with,

$reglist = join(',', sort(split(',', $reglist)));

Or any other way. Unfortunately, there doesn't appear to be anyway to do this using assembler constraints. If we had access to an assigned register number, inline alternative or conditional compiling could be used.

Probably the easiest solution is to use explicit register assignment. Unless you are writing a vector library that needs to load/store multiple values and you want to give the compiler some freedom to generate better code. In this case, it is probably better to use structures as the higher level gcc optimizations will be able to detect un-needed operation (such as multiplies by one or addition of zero, etc).

Edit:

Because there are recommendations to use "higher level" constructs... The problem I want to solve is packing of 20 bits of a 32 bit word (e.g. input is 8 words, output is 5 words).

This will probably give better results,

  u32 *ip, *op;
  u32 in, out, mask;
  int shift = 0;
  const u32 *op_end = op + 5;

  while(op != op_end) {
     in = *ip++;
     /* mask and accumulate... */
     if(shift >= 32) {
       *op++ = out;
       shift -=32;
     }
  }

The reasoning is that the ARM pipeline is generally several stages. With a separate load/store unit. ALU (arithmetic) may proceed in parallel with the load and the store. So you can be working on the first word while you are loading later words. In this case, you may also replace the value in-place which will give a cache benefit, unless you need to re-use the 20-bit values. Once the code is in the cache, the ldm/stm has little benefit if you stall on data. That will be your case.

2nd Edit: The main job of a compiler is to not load values from memory. Ie, register assignment is crucial. Generally, the ldm/stm are most useful in memory transfer functions. Ie, a memory test, a memcpy() implementation, etc. If you are doing computation with the data, then the compiler may have better knowledge about pipe line scheduling. You probably need to either accept plain 'C' code or move to complete assembler. Remember, the ldm has the first operands available to use immediately. Use of the ALU with subsequent registers can cause a stall for the data to load. Similarly, the stm needs the first register calculations to be complete when it executes; but this is less critical.

`ldm/stm` in gcc inline ARM assembly

EDIT:

1 Answers