For my BigIntegers, in the PUREPASCAL implementation (i.e. no assembler allowed), I must multiply two UInt32
to get an UInt64
result.
The usual way to do this is to widen at least one of the operands, so you get a 64 bit multiplication:
Res := UInt64(A) * B;
where Res
is UInt64
and A
and B
are UInt32
.
But, In Win32, this produces a rather unwieldy piece of machine code:
MulTest.dpr.431: Res := UInt64(A) * B;
004DB463 8B45F8 mov eax,[ebp-$08] // load A
004DB466 33D2 xor edx,edx // make it UInt64
004DB468 52 push edx // push A
004DB469 50 push eax
004DB46A 8B45FC mov eax,[ebp-$04] // load B
004DB46D 33D2 xor edx,edx // make it UInt64
004DB46F E87C0AF3FF call @_llmul // 64 bit multiplication
004DB474 8945E8 mov [ebp-$18],eax // store 64 bit result
004DB477 8955EC mov [ebp-$14],edx
Now, if you just do:
Res := A * B;
you unfortunately get a 32 bit intermediate result (the top 32 bits of the actual result are simply zeroed out):
MulTest.dpr.435: Res := A * B;
004DB4BD 8B45FC mov eax,[ebp-$04]
004DB4C0 F76DF8 imul dword ptr [ebp-$08]
004DB4C3 33D2 xor edx,edx // zero out top 32 bits
004DB4C5 8945E8 mov [ebp-$18],eax
004DB4C8 8955EC mov [ebp-$14],edx
Now, if the line xor edx,edx
were not there, the result would be exactly what I need. This would be more than twice as fast (i.e. take less than half the time) as the widened version using the UInt64
cast.
Question: Does anyone know if there is a pseudo-function or trick or cast that does not discard the top 32 bits of the 64 bit result? I know how to do this in assembler, but this must be PUREPASCAL (it should work on other platforms too).
I managed to make 32 bit additions in PUREPASCAL much faster by accessing the array of 32 bit unisgned integers that makes up a BigInteger as an array of unsigned 16 bit integers and adding these instead. So I also tried multiplying using 16 bit intermediate results:
// Too slow: in a test, 2973 ms for Mul32(A, B) vs 1432 ms for UInt64(A) * B.
function MulU32ToU64(L, R: UInt32): UInt64; inline;
var
L0R0, L0R1, L1R0, L1R1, Sum: UInt32;
type
TUInt64 = packed record
case Byte of
0: (L0, L1, L2, L3: UInt16);
1: (I0, I1: UInt32);
end;
TUInt32 = packed record
Lo, Hi: Word;
end;
begin
L0R0 := TUInt32(L).Lo * TUInt32(R).Lo;
L0R1 := TUInt32(L).Lo * TUInt32(R).Hi;
L1R0 := TUInt32(L).Hi * TUInt32(R).Lo;
L1R1 := TUInt32(L).Hi * TUInt32(R).Hi;
TUInt64(Result).L0 := TUInt32(L0R0).Lo;
Sum := UInt32(TUInt32(L0R0).Hi) + TUInt32(L1R0).Lo + TUInt32(L0R1).Lo;
TUInt64(Result).L1 := TUInt32(Sum).Lo;
Sum := UInt32(TUInt32(Sum).Hi) + TUInt32(L1R0).Hi + TUInt32(L0R1).Hi + L1R1;
TUInt64(Result).I1 := Sum;
end;
It gives me the correct result, but is more than twice as slow as UInt64(A) * B
. That is no surprise, as it does 4 UInt32 multiplications and a lot of additions, which makes it slower than the code using System.__llmul
.
Update
As @J... pointed out, Delphi generally uses IMUL
, which does a signed multiplication. So the multiplication of e.g. $00000002
and $FFFFFFFF
results in EAX = $FFFFFFFE
and EDX = $FFFFFFFF
(in other words, an Int64
with value -2
), while I would need EAX = $FFFFFFFE
(the same), but EDX = $00000001
(together a UInt64
with value $00000001FFFFFFFE
). So it is right that the top 32 bits are discarded, and there seems to be no way to coerce Delphi into using MUL
and keeping the top 32 bits of the result of that.
UInt64(A) * B
is almost as fast as a plainA * B
). My problem is with 32 bit code, like Win32 or Mac. I didn't try on ARM platforms. – Rudy Velthuis