3
votes

I'm using the latest available version of ARM-packaged GCC:

arm-none-eabi-gcc (GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release) Copyright (C) 2020 Free Software Foundation, Inc.

When I compile this code using "-mcpu=cortex-m0 -mthumb -Ofast":

int main(void) {
    uint16_t num = (uint16_t) ADC1->DR;
    ADC1->DR = num / 7;
}

I would expect that the division would be accomplished by a multiplication and a shift, but instead this code is being generated:

 08000b5c <main>:
 8000b5c: b510 push {r4, lr}
 8000b5e: 4c05 ldr r4, [pc, #20] ; (8000b74 <main+0x18>)
 8000b60: 2107 movs r1, #7
 8000b62: 6c20 ldr r0, [r4, #64] ; 0x40
 8000b64: b280 uxth r0, r0
 8000b66: f7ff facf bl 8000108 <__udivsi3>
 8000b6a: b280 uxth r0, r0
 8000b6c: 6420 str r0, [r4, #64] ; 0x40
 8000b6e: 2000 movs r0, #0
 8000b70: bd10 pop {r4, pc}
 8000b72: 46c0 nop ; (mov r8, r8)
 8000b74: 40012400 .word 0x40012400

Using __udivsi3 instead of multiply and shift is terribly inefficient. Am I using the wrong flags, or missing something else, or is this a GCC bug?

3

3 Answers

3
votes

The Cortex-M0 lacks instructions to perform a 32x32->64-bit multiply. Because num is an unsigned 16-bit quantity, multiplying it by 9363 and shifting right 16 would yield a correct result in all cases, but--likely because a uint16_t will be promoted to int before the multiply, gcc does not include such optimizations.

From what I've observed, gcc does a generally poor job of optimizing for the Cortex-M0, failing to employ some straightforward optimizations which would be appropriate for that platform, but sometimes employing "optimizations" which aren't. Given something like

void test1(uint8_t *p)
{
    for (int i=0; i<32; i++)
        p[i] = (p[i]*9363) >> 16; // Divide by 7
}

gcc happens to generate okay code for the Cortex-M0 at -O2, but if the multiplication were replaced with an addition the compiler would generate code which reloads the constant 9363 on every iteration of the loop. When using addition, even if the code were changed to:

void test2(uint16_t *p)
{
    register unsigned u9363 = 9363;
    for (int i=0; i<32; i++)
        p[i] = (p[i]+u9363) >> 16;
}

gcc would still bring the load of the constant into the loop. Sometimes gcc's optimizations may also have unexpected behavioral consequences. For example, one might expect that on a platform like a Cortex-M0, invoking something like:

unsigned short test(register unsigned short *p)
{
    register unsigned short temp = *p;
    return temp - (temp >> 15);
}    

while an interrupt changes the contents of *p might yield behavior consistent with the old value or the new value. The Standard wouldn't require such treatment, but most implementations intended to be suitable for embedded programming tasks will offer stronger guarantees than what the Standard requires. If either the old or new value would be equally acceptable, letting the compiler use whichever is more convenient may allow more efficient code than using volatile. As it happens, however, the "optimized" code from gcc will replace the two uses of temp with separate loads of *p.

If you're using gcc with the Cortex-M0 and are at all concerned about performance or the possibility of "astonishing" behaviors, get in the habit of inspecting the compiler's output. For some kinds of loop, it might even be worth considering testing out -O0. If code makes suitable use of the register keyword, its performance can sometimes beat that of identical code processed with -O2.

-1
votes

Expanding on supercat's answer.

Feed this:

unsigned short fun ( unsigned short x )
{
    return(x/7);
}

to something with a larger multiply:

00000000 <fun>:
   0:   e59f1010    ldr r1, [pc, #16]   ; 18 <fun+0x18>
   4:   e0832190    umull   r2, r3, r0, r1
   8:   e0400003    sub r0, r0, r3
   c:   e08300a0    add r0, r3, r0, lsr #1
  10:   e1a00120    lsr r0, r0, #2
  14:   e12fff1e    bx  lr
  18:   24924925    .word   0x24924925
  

1/7 in binary (long division):

     0.001001001001001
 111)1.000000
       111 
      ==== 
         1000
          111
          ===
            1
            
        
0.001001001001001001001001001001
0.0010 0100 1001 0010 0100 1001 001001
0x2492492492...
0x24924925>>32  (rounded up)

For this to work you need a 64 bit result, you take the top half and do some adjustments, so for example:

7 * 0x24924925 = 0x100000003

and you take the top 32 bits (not completely this simple but for this value you can see it working).

The all thumbs variant multiply is 32 bits = 32 bits * 32 bits, so the result would be 0x00000003 and that does not work.

So 0x24924 which we can make 0x2493 as supercat did or 0x2492.

Now we can use the 32x32 = 32 bit multiply:

0x2492 * 7 = 0x0FFFE
0x2493 * 7 = 0x10005

Let's run with the one larger:

0x100000000/0x2493 = a number greater than 65536. so that is fine.

but:

0x3335 * 0x2493 = 0x0750DB6F
0x3336 * 0x2493 = 0x07510002
0x3335 / 7 = 0x750
0x3336 / 7 = 0x750

So you can only get so far with that approach.

If we follow the model of the arm code:

for(ra=0;ra<0x10000;ra++)
{
    rb=0x2493*ra;
    rd=rb>>16;
    rb=ra-rd;
    rb=rd+(rb>>1);
    rb>>=2;
    rc=ra/7;
    printf("0x%X 0x%X 0x%X \n",ra,rb,rc);
    if(rb!=rc) break;
}

Then it works from 0x0000 to 0xFFFF, so you could write the asm to do that (note it needs to be 0x2493 not 0x2492).

If you know the operand is not going above a certain value then you can use more bits of 1/7th to multiply against.

In any case when the compiler does not do this optimization for you then you might still have a chance yourself.

Now that I think about it I ran into this before, and now it makes sense. But I was on a full sized arm and I called a routine I compiled in arm mode (the other code was in thumb mode), and had a switch statement basically if denominator = 1 then result = x/1; if denominator = 2 then result = x/2 and so on. And then it avoided the gcclib function and generated the 1/x multiplies. (I had like 3 or 4 different constants to divide by):

unsigned short udiv7 ( unsigned short x )
{
    unsigned int r0;
    unsigned int r3;
    
    r0=x;
    r3=0x2493*r0;
    r3>>=16;
    r0=r0-r3;
    r0=r3+(r0>>1);
    r0>>=2;
    return(r0);
}

Assuming I made no mistakes:

00000000 <udiv7>:
   0:   4b04        ldr r3, [pc, #16]   ; (14 <udiv7+0x14>)
   2:   4343        muls    r3, r0
   4:   0c1b        lsrs    r3, r3, #16
   6:   1ac0        subs    r0, r0, r3
   8:   0840        lsrs    r0, r0, #1
   a:   18c0        adds    r0, r0, r3
   c:   0883        lsrs    r3, r0, #2
   e:   b298        uxth    r0, r3
  10:   4770        bx  lr
  12:   46c0        nop         ; (mov r8, r8)
  14:   00002493    .word   0x00002493

That should be faster than a generic division library routine.

Edit

I think I see what supercat has done with the solution that works:

((i*37449 + 16384u) >> 18)

We have this as the 1/7th fraction:

0.001001001001001001001001001001

but we can only do a 32 = 32x32 bit multiply. The leading zeros give us some breathing room we might be able to take advantage of. So instead of 0x2492/0x2493 we can try:

1001001001001001
0x9249
0x9249*0xFFFF = 0x92486db7

And so far it won't overflow:

    rb=((ra*0x9249) >> 18);

by itself it fails at 7 * 0x9249 = 0x3FFFF, 0x3FFFF>>18 is zero not 1.

So maybe

    rb=((ra*0x924A) >> 18);

that fails at:

    0xAAAD 0x1862 0x1861 

So what about:

    rb=((ra*0x9249 + 0x8000) >> 18);

and that works.

What about supercat's?

    rb=((ra*0x9249 + 0x4000) >> 18);

and that runs clean for all values 0x0000 to 0xFFFF:

    rb=((ra*0x9249 + 0x2000) >> 18);

and that fails here:

0xE007 0x2000 0x2001 

So there are a couple of solutions that work.

unsigned short udiv7 ( unsigned short x )
{
    unsigned int ret;
    ret=x;
    ret=((ret*0x9249 + 0x4000) >> 18);
    return(ret);
}
00000000 <udiv7>:
   0:   4b03        ldr r3, [pc, #12]   ; (10 <udiv7+0x10>)
   2:   4358        muls    r0, r3
   4:   2380        movs    r3, #128    ; 0x80
   6:   01db        lsls    r3, r3, #7
   8:   469c        mov ip, r3
   a:   4460        add r0, ip
   c:   0c80        lsrs    r0, r0, #18
   e:   4770        bx  lr
  10:   00009249    .word   0x00009249

Edit

As far as the "why" question goes, that is not a Stack Overflow question; if you want to know why gcc doesn't do this, ask the authors of that code. All we can do is speculate here and the speculation is they may either have chosen not to because of the number of instructions or they may have chosen not to because they have an algorithm that states because this is not a 64 = 32x32 bit multiply then do not bother.

Again the why question is not a Stack Overflow question, so perhaps we should just close this question and delete all of the answers.

I found this to be incredibly educational (once you know/understand what was being said).

Another WHY? question is why did gcc do it the way they did it when they could have done it the way supercat or I did it?

-2
votes

The compiler can only rearrange integer expressions if it knows that the result will be correct for any input allowed by the language.

Because 7 is co-prime to 2, it is impossible to carry out dividing any input by seven with multiplying and shifting.

If you know that it is possible for the input that you intend to provide, then you have to do it yourself using the multiply and shift operators.

Depending on the size of the input, you will have to choose how much to shift so that the output is correct (or at least good enough for your application) and so that the intermediate doesn't overflow. The compiler has no way of knowing what is accurate enough for your application, or what your maximum input will be. If it allows any input up to the maximum of the type, then every multiplication will overflow.

In general GCC will only carry out division using shifting if the divisor is not co-prime to 2, that is if it is a power of two.