4
votes

Hi I'm implementing some fixed point math stuff for embedded systems and I'm trying to do the multiplication of two 16.16 fixed point numbers without creating a 64bit temporary. So far here is the code I came up with that generates the least instructions.

int multiply(int x, int y){
    int result;
    long long temp = x;
    temp *= y;
    temp >>= 16;
    result = temp;
    return result;
}

the problem with this code is that it uses a temporary 64 bit integer which seem to generate bad assembly code. I'm trying to make a system that uses two 32 bit integers instead of a 64 bit one. Anyone know how to do this?

1
Did you see a 64-bit temporary in the disassembly dump of the output, or just in your head? Just because the source code has a variable of type long long does not mean there's actually any "64-bit temporary". On any decent 32-bit arch, a 32x32 multiply automatically generates a 64-bit result, usually separated into two 32-bit registers for your (or the compiler's) convenience. Manipulating these is much more efficient than breaking it down into 4 16x16 multiplies to avoid the "64-bit temporary". - R.. GitHub STOP HELPING ICE
@R.. yeah I looked into the assembler dump. it uses two halves in the EAX and EDX register but it's implementation is not optimal - PgrAm
It's surely a lot less suboptimal than performing four MULs... - R.. GitHub STOP HELPING ICE
@R.. the idea is that now that I know how to do it in C I can implement it in hand optimized assembly - PgrAm
most modern x86 compilers will automatically use the upper 32 bits of the result if you specify the optimization level high enough - phuclv

1 Answers

6
votes

Think of your numbers as each composed of two large "digits."

  A.B
x C.D

The "base" of the digits is the 2^bit_width, i.e., 2^16, or 65536.

So, the product is

D*B       + D*A*65536 + C*B*65536 + C*A*65536*65536

However, to get the product shifted right by 16, you need to divide all these terms by 65536, so

D*B/65536 + D*A       + C*B        + C*A*65536

In C:

uint16_t a = x >> 16;
uint16_t b = x & 0xffff;
uint16_t c = y >> 16;
uint16_t d = y & 0xffff;

return ((d * b) >> 16) + (d * a) + (c * b) + ((c * a) << 16);

The signed version is a bit more complicated; it is often easiest to perform the arithmetic on the absolute values of x and y and then fix the sign (unless you overflow, which you can check for rather tediously).