Expanding on supercat's answer.
Feed this:
unsigned short fun ( unsigned short x )
{
return(x/7);
}
to something with a larger multiply:
00000000 <fun>:
0: e59f1010 ldr r1, [pc, #16] ; 18 <fun+0x18>
4: e0832190 umull r2, r3, r0, r1
8: e0400003 sub r0, r0, r3
c: e08300a0 add r0, r3, r0, lsr #1
10: e1a00120 lsr r0, r0, #2
14: e12fff1e bx lr
18: 24924925 .word 0x24924925
1/7 in binary (long division):
0.001001001001001
111)1.000000
111
====
1000
111
===
1
0.001001001001001001001001001001
0.0010 0100 1001 0010 0100 1001 001001
0x2492492492...
0x24924925>>32 (rounded up)
For this to work you need a 64 bit result, you take the top half and do some adjustments, so for example:
7 * 0x24924925 = 0x100000003
and you take the top 32 bits (not completely this simple but for this value you can see it working).
The all thumbs variant multiply is 32 bits = 32 bits * 32 bits, so the result would be 0x00000003 and that does not work.
So 0x24924 which we can make 0x2493 as supercat did or 0x2492.
Now we can use the 32x32 = 32 bit multiply:
0x2492 * 7 = 0x0FFFE
0x2493 * 7 = 0x10005
Let's run with the one larger:
0x100000000/0x2493 = a number greater than 65536. so that is fine.
but:
0x3335 * 0x2493 = 0x0750DB6F
0x3336 * 0x2493 = 0x07510002
0x3335 / 7 = 0x750
0x3336 / 7 = 0x750
So you can only get so far with that approach.
If we follow the model of the arm code:
for(ra=0;ra<0x10000;ra++)
{
rb=0x2493*ra;
rd=rb>>16;
rb=ra-rd;
rb=rd+(rb>>1);
rb>>=2;
rc=ra/7;
printf("0x%X 0x%X 0x%X \n",ra,rb,rc);
if(rb!=rc) break;
}
Then it works from 0x0000 to 0xFFFF, so you could write the asm to do that (note it needs to be 0x2493 not 0x2492).
If you know the operand is not going above a certain value then you can use more bits of 1/7th to multiply against.
In any case when the compiler does not do this optimization for you then you might still have a chance yourself.
Now that I think about it I ran into this before, and now it makes sense. But I was on a full sized arm and I called a routine I compiled in arm mode (the other code was in thumb mode), and had a switch statement basically if denominator = 1 then result = x/1; if denominator = 2 then result = x/2 and so on. And then it avoided the gcclib function and generated the 1/x multiplies. (I had like 3 or 4 different constants to divide by):
unsigned short udiv7 ( unsigned short x )
{
unsigned int r0;
unsigned int r3;
r0=x;
r3=0x2493*r0;
r3>>=16;
r0=r0-r3;
r0=r3+(r0>>1);
r0>>=2;
return(r0);
}
Assuming I made no mistakes:
00000000 <udiv7>:
0: 4b04 ldr r3, [pc, #16] ; (14 <udiv7+0x14>)
2: 4343 muls r3, r0
4: 0c1b lsrs r3, r3, #16
6: 1ac0 subs r0, r0, r3
8: 0840 lsrs r0, r0, #1
a: 18c0 adds r0, r0, r3
c: 0883 lsrs r3, r0, #2
e: b298 uxth r0, r3
10: 4770 bx lr
12: 46c0 nop ; (mov r8, r8)
14: 00002493 .word 0x00002493
That should be faster than a generic division library routine.
Edit
I think I see what supercat has done with the solution that works:
((i*37449 + 16384u) >> 18)
We have this as the 1/7th fraction:
0.001001001001001001001001001001
but we can only do a 32 = 32x32 bit multiply. The leading zeros give us some breathing room we might be able to take advantage of. So instead of 0x2492/0x2493 we can try:
1001001001001001
0x9249
0x9249*0xFFFF = 0x92486db7
And so far it won't overflow:
rb=((ra*0x9249) >> 18);
by itself it fails at 7 * 0x9249 = 0x3FFFF, 0x3FFFF>>18 is zero not 1.
So maybe
rb=((ra*0x924A) >> 18);
that fails at:
0xAAAD 0x1862 0x1861
So what about:
rb=((ra*0x9249 + 0x8000) >> 18);
and that works.
What about supercat's?
rb=((ra*0x9249 + 0x4000) >> 18);
and that runs clean for all values 0x0000 to 0xFFFF:
rb=((ra*0x9249 + 0x2000) >> 18);
and that fails here:
0xE007 0x2000 0x2001
So there are a couple of solutions that work.
unsigned short udiv7 ( unsigned short x )
{
unsigned int ret;
ret=x;
ret=((ret*0x9249 + 0x4000) >> 18);
return(ret);
}
00000000 <udiv7>:
0: 4b03 ldr r3, [pc, #12] ; (10 <udiv7+0x10>)
2: 4358 muls r0, r3
4: 2380 movs r3, #128 ; 0x80
6: 01db lsls r3, r3, #7
8: 469c mov ip, r3
a: 4460 add r0, ip
c: 0c80 lsrs r0, r0, #18
e: 4770 bx lr
10: 00009249 .word 0x00009249
Edit
As far as the "why" question goes, that is not a Stack Overflow question; if you want to know why gcc doesn't do this, ask the authors of that code. All we can do is speculate here and the speculation is they may either have chosen not to because of the number of instructions or they may have chosen not to because they have an algorithm that states because this is not a 64 = 32x32 bit multiply then do not bother.
Again the why question is not a Stack Overflow question, so perhaps we should just close this question and delete all of the answers.
I found this to be incredibly educational (once you know/understand what was being said).
Another WHY? question is why did gcc do it the way they did it when they could have done it the way supercat or I did it?