Signed Multiplication of 1024-bit 2's complement numbers using 32-bit chunks

Question

So I have the following struct definition for my 1024-bit number (I want to use 2's complement representation here and I am on a 32-bit system):

typedef struct int1024
{
    int32_t num[32]; //should I use uint32_t?
} int1024;

Basically an array that holds the segments of the number.

For add since, signed and unsigned addition are the same. I can simply use the instructions add and adc to carry out the extended operation for my bigger array.

Now for multiplication. I was able to get an unsigned multiplication to work using a combination of imul (using the upper and lower results) and adc using the classic O(n^2) multiplication algorithm. However I need signed multiplication for my results. I know people say just to use the signed magnitude version. i.e. take the 2's complement when negative and at the end apply 2's complement if needed. But all the extra noting and adding 1 will be really expensive, since I need to do a lot of multiplications. Is there a technique for doing large signed multiplication for number representations like this. (I don't really want any branching or recursive calls here). I guess one of my main problems is how to deal with carry and the high part of the multiplication of negative numbers. (just an idea, maybe I could use a combination of signed and unsigned multiplication, but I don't know where to begin). If you don't have time for 1024 bits, a 128 bit answer will suffice (I will just generalize it).

You would want to use an unsigned integer type for the array elements. Note that in an NxN->N integer multiplication (argument width = result width) it does not matter whether the multiplication is a signed or unsigned operation: the least significant N bits of the result are the same. Whether the multiply is signed or unsigned matters for the most significant N bits of a wide (NxN->2N) multiply, or the equivalent multiply-high operation. In that case the most significant bits of the signed multiply result can easily be generated from the most significant bits of the unsigned multiply. — njuffa
Are you doing this as an exercise? The GMP library does this and much more. See: gmplib.org BTW, what you call num they use unsigned long long for (if machine is natively 64 bits) And, each number maintains a count of cells, so they handle variable length numbers and grow as needed [which can speed things up as it is slow to do 32 multiplies if (e.g.) each number only uses a few number of bits] — Craig Estey
The negation is relatively cheap compared with the 'longhand' multiplication. My technique for big arithmetic is to work with 32-bit arrays and 64-bit intermediate values. — Weather Vane
@WeatherVane On ARM I used UMLAL (Unsigned Multiply and Accumulate Long) that does 32x32 + 32 -> 64, very handy to propagate the Carry. Is there anything similar in x86 ISA? — tum_
@tum_: x86's has always had widening multiply, but still doesn't have accumulate. (felixcloutier.com/x86/mul, up to 64x64 => 128-bit in x86-64 code.) Intel has a whitepaper comparing that against BMI2 MULX (widening mul that doesn't touch FLAGS, so it doesn't disturb ADC carry chains, and fewer implicit operands, only RDX input), and against MULX + ADOX/ADCX (Broadwell) to allow 2 independent carry chains (in CF and in OF): intel.com/content/dam/www/public/us/en/documents/white-papers/… for 512x64-bit or 512x512 — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2021-03-17T04:04:14

Note that in a 1024-bit integer only the very top bit, bit 1023, is the sign bit and has a place-value of -(2**1023). All the other bits have their normal +(2**pos) place value. i.e. the lower "limbs" all need to be treated as unsigned when you do widening multiplies on them.

You have one sign bit and 1023 lower bits. Not one sign bit per limb.

Also, the difference between signed and unsigned multiply is only in the high half of a widening multiply (N x N => 2N bits). That's why x86 has separate instructions for imul r/m64 and mul r/m64 to do a full-multiply into RDX:RAX (https://www.felixcloutier.com/x86/imul vs. https://www.felixcloutier.com/x86/mul). But for non-widening there's only imul r, r/m and imul r, r/m, imm which compilers use for both unsigned and signed (and so should humans).

Since you want a 1024x1024 => 1024-bit product which discards the upper 1024 bits, you can and should just make your whole thing unsigned.

When you're writing in asm, you can use whatever chunk size you can, e.g. 64-bit in 64-bit mode, not limiting yourself to the C chunk size.

See https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf for how to use BMI2 mulx (leaves FLAGS untouched so doesn't disturb adc chains, and only has RDX as an implicit input, with other source and both outputs being explicit, saving on MOV instructions). And optionally also Broadwell ADOX / ADCX to run two dep chains in parallel to get more ILP for medium sized BigInteger stuff like 512x512-bit or 512x64-bit (which they use as an example).

Signed Multiplication of 1024-bit 2's complement numbers using 32-bit chunks

1 Answers