Multiplying two UInt32 to obtain a UInt64 without widening

Question

For my BigIntegers, in the PUREPASCAL implementation (i.e. no assembler allowed), I must multiply two UInt32 to get an UInt64 result.

The usual way to do this is to widen at least one of the operands, so you get a 64 bit multiplication:

Res := UInt64(A) * B;

where Res is UInt64 and A and B are UInt32.

But, In Win32, this produces a rather unwieldy piece of machine code:

MulTest.dpr.431: Res := UInt64(A) * B;
004DB463 8B45F8           mov eax,[ebp-$08]  // load A 
004DB466 33D2             xor edx,edx        // make it UInt64
004DB468 52               push edx           // push A
004DB469 50               push eax
004DB46A 8B45FC           mov eax,[ebp-$04]  // load B
004DB46D 33D2             xor edx,edx        // make it UInt64 
004DB46F E87C0AF3FF       call @_llmul       // 64 bit multiplication
004DB474 8945E8           mov [ebp-$18],eax  // store 64 bit result
004DB477 8955EC           mov [ebp-$14],edx

Now, if you just do:

Res := A * B;

you unfortunately get a 32 bit intermediate result (the top 32 bits of the actual result are simply zeroed out):

MulTest.dpr.435: Res := A * B;
004DB4BD 8B45FC           mov eax,[ebp-$04]
004DB4C0 F76DF8           imul dword ptr [ebp-$08]
004DB4C3 33D2             xor edx,edx              // zero out top 32 bits
004DB4C5 8945E8           mov [ebp-$18],eax
004DB4C8 8955EC           mov [ebp-$14],edx

Now, if the line xor edx,edx were not there, the result would be exactly what I need. This would be more than twice as fast (i.e. take less than half the time) as the widened version using the UInt64 cast.

Question: Does anyone know if there is a pseudo-function or trick or cast that does not discard the top 32 bits of the 64 bit result? I know how to do this in assembler, but this must be PUREPASCAL (it should work on other platforms too).

I managed to make 32 bit additions in PUREPASCAL much faster by accessing the array of 32 bit unisgned integers that makes up a BigInteger as an array of unsigned 16 bit integers and adding these instead. So I also tried multiplying using 16 bit intermediate results:

// Too slow: in a test, 2973 ms for Mul32(A, B) vs 1432 ms for UInt64(A) * B.
function MulU32ToU64(L, R: UInt32): UInt64; inline;
var
  L0R0, L0R1, L1R0, L1R1, Sum: UInt32;
type
  TUInt64 = packed record
    case Byte of
      0: (L0, L1, L2, L3: UInt16);
      1: (I0, I1: UInt32);
  end;
  TUInt32 = packed record
    Lo, Hi: Word;
  end;
begin
  L0R0 := TUInt32(L).Lo * TUInt32(R).Lo;
  L0R1 := TUInt32(L).Lo * TUInt32(R).Hi;
  L1R0 := TUInt32(L).Hi * TUInt32(R).Lo;
  L1R1 := TUInt32(L).Hi * TUInt32(R).Hi;
  TUInt64(Result).L0 := TUInt32(L0R0).Lo;
  Sum := UInt32(TUInt32(L0R0).Hi) + TUInt32(L1R0).Lo + TUInt32(L0R1).Lo;
  TUInt64(Result).L1 := TUInt32(Sum).Lo;
  Sum := UInt32(TUInt32(Sum).Hi) + TUInt32(L1R0).Hi + TUInt32(L0R1).Hi + L1R1;
  TUInt64(Result).I1 := Sum;
end;

It gives me the correct result, but is more than twice as slow as UInt64(A) * B. That is no surprise, as it does 4 UInt32 multiplications and a lot of additions, which makes it slower than the code using System.__llmul.

Update

As @J... pointed out, Delphi generally uses IMUL, which does a signed multiplication. So the multiplication of e.g. $00000002 and $FFFFFFFF results in EAX = $FFFFFFFE and EDX = $FFFFFFFF (in other words, an Int64 with value -2), while I would need EAX = $FFFFFFFE (the same), but EDX = $00000001 (together a UInt64 with value $00000001FFFFFFFE). So it is right that the top 32 bits are discarded, and there seems to be no way to coerce Delphi into using MUL and keeping the top 32 bits of the result of that.

Have you checked the code generation for the other platforms? If the problem only exists in Win32 I would probably go for an IFDEF with assembler for that case. — Uwe Raabe
@UweRaabe: No, I haven't checked the code generated for other platforms. On 64 bit platforms, the code is probably quite performant (on Win64, UInt64(A) * B is almost as fast as a plain A * B). My problem is with 32 bit code, like Win32 or Mac. I didn't try on ARM platforms. — Rudy Velthuis
@UweRaabe: Sure, I could use assembler on Win32, but that would violate my own rule that the code should be compilable as PUREPASCAL, and use no assembler. — Rudy Velthuis
FWIW, I thought I had found Int32x32To64(), but these turn out to be macros in winnt.h, and internally, they do exactly what I do (cast to UInt64), so no gain there. — Rudy Velthuis

J... J... · Accepted Answer · 2017-08-26T11:03:10

MulTest.dpr.435: Res := A * B;
004DB4BD 8B45FC           mov eax,[ebp-$04]
004DB4C0 F76DF8           imul dword ptr [ebp-$08]
004DB4C3 33D2             xor edx,edx              // zero out top 32 bits
004DB4C5 8945E8           mov [ebp-$18],eax
004DB4C8 8955EC           mov [ebp-$14],edx

Now, if the line xor edx,edx were not there, the result would be exactly what I need.

No, it isn't what you want at all. That's a signed multiply and the result is nonsense if you want an unsigned result. Make A:=$FFFFFFFF and B:=2 - the result of imul is EAX = FFFFFFFE and EDX = FFFFFFFF. This opcode is emitted even with two unsigned operands. You want the mul instruction, not imul. I don't think the delphi compiler will ever emit mul from pure pascal. From the documentation on *(emphasis mine)

The value of x / y is of type Extended, regardless of the types of x and y. For other arithmetic operators, the result is of type Extended whenever at least one operand is a real; otherwise, the result is of type Int64 when at least one operand is of type Int64; otherwise, the result is of type Integer.

Integer - signed. Given how dependent this is on the idiosyncracies of the architecture, and given the deficiencies of the delphi compilers, I think the only performant solution here is going to be target-dependent assembly.

function UMul3264(x, y : UInt32) : UInt64;
asm
  mul eax, edx
end;

Multiplying two UInt32 to obtain a UInt64 without widening

Update

2 Answers