12
votes

For my BigIntegers, in the PUREPASCAL implementation (i.e. no assembler allowed), I must multiply two UInt32 to get an UInt64 result.

The usual way to do this is to widen at least one of the operands, so you get a 64 bit multiplication:

Res := UInt64(A) * B;

where Res is UInt64 and A and B are UInt32.

But, In Win32, this produces a rather unwieldy piece of machine code:

MulTest.dpr.431: Res := UInt64(A) * B;
004DB463 8B45F8           mov eax,[ebp-$08]  // load A 
004DB466 33D2             xor edx,edx        // make it UInt64
004DB468 52               push edx           // push A
004DB469 50               push eax
004DB46A 8B45FC           mov eax,[ebp-$04]  // load B
004DB46D 33D2             xor edx,edx        // make it UInt64 
004DB46F E87C0AF3FF       call @_llmul       // 64 bit multiplication
004DB474 8945E8           mov [ebp-$18],eax  // store 64 bit result
004DB477 8955EC           mov [ebp-$14],edx

Now, if you just do:

Res := A * B;

you unfortunately get a 32 bit intermediate result (the top 32 bits of the actual result are simply zeroed out):

MulTest.dpr.435: Res := A * B;
004DB4BD 8B45FC           mov eax,[ebp-$04]
004DB4C0 F76DF8           imul dword ptr [ebp-$08]
004DB4C3 33D2             xor edx,edx              // zero out top 32 bits
004DB4C5 8945E8           mov [ebp-$18],eax
004DB4C8 8955EC           mov [ebp-$14],edx

Now, if the line xor edx,edx were not there, the result would be exactly what I need. This would be more than twice as fast (i.e. take less than half the time) as the widened version using the UInt64 cast.

Question: Does anyone know if there is a pseudo-function or trick or cast that does not discard the top 32 bits of the 64 bit result? I know how to do this in assembler, but this must be PUREPASCAL (it should work on other platforms too).

I managed to make 32 bit additions in PUREPASCAL much faster by accessing the array of 32 bit unisgned integers that makes up a BigInteger as an array of unsigned 16 bit integers and adding these instead. So I also tried multiplying using 16 bit intermediate results:

// Too slow: in a test, 2973 ms for Mul32(A, B) vs 1432 ms for UInt64(A) * B.
function MulU32ToU64(L, R: UInt32): UInt64; inline;
var
  L0R0, L0R1, L1R0, L1R1, Sum: UInt32;
type
  TUInt64 = packed record
    case Byte of
      0: (L0, L1, L2, L3: UInt16);
      1: (I0, I1: UInt32);
  end;
  TUInt32 = packed record
    Lo, Hi: Word;
  end;
begin
  L0R0 := TUInt32(L).Lo * TUInt32(R).Lo;
  L0R1 := TUInt32(L).Lo * TUInt32(R).Hi;
  L1R0 := TUInt32(L).Hi * TUInt32(R).Lo;
  L1R1 := TUInt32(L).Hi * TUInt32(R).Hi;
  TUInt64(Result).L0 := TUInt32(L0R0).Lo;
  Sum := UInt32(TUInt32(L0R0).Hi) + TUInt32(L1R0).Lo + TUInt32(L0R1).Lo;
  TUInt64(Result).L1 := TUInt32(Sum).Lo;
  Sum := UInt32(TUInt32(Sum).Hi) + TUInt32(L1R0).Hi + TUInt32(L0R1).Hi + L1R1;
  TUInt64(Result).I1 := Sum;
end;

It gives me the correct result, but is more than twice as slow as UInt64(A) * B. That is no surprise, as it does 4 UInt32 multiplications and a lot of additions, which makes it slower than the code using System.__llmul.

Update

As @J... pointed out, Delphi generally uses IMUL, which does a signed multiplication. So the multiplication of e.g. $00000002 and $FFFFFFFF results in EAX = $FFFFFFFE and EDX = $FFFFFFFF (in other words, an Int64 with value -2), while I would need EAX = $FFFFFFFE (the same), but EDX = $00000001 (together a UInt64 with value $00000001FFFFFFFE). So it is right that the top 32 bits are discarded, and there seems to be no way to coerce Delphi into using MUL and keeping the top 32 bits of the result of that.

2
Have you checked the code generation for the other platforms? If the problem only exists in Win32 I would probably go for an IFDEF with assembler for that case.Uwe Raabe
@UweRaabe: No, I haven't checked the code generated for other platforms. On 64 bit platforms, the code is probably quite performant (on Win64, UInt64(A) * B is almost as fast as a plain A * B). My problem is with 32 bit code, like Win32 or Mac. I didn't try on ARM platforms.Rudy Velthuis
@UweRaabe: Sure, I could use assembler on Win32, but that would violate my own rule that the code should be compilable as PUREPASCAL, and use no assembler.Rudy Velthuis
FWIW, I thought I had found Int32x32To64(), but these turn out to be macros in winnt.h, and internally, they do exactly what I do (cast to UInt64), so no gain there.Rudy Velthuis

2 Answers

6
votes
MulTest.dpr.435: Res := A * B;
004DB4BD 8B45FC           mov eax,[ebp-$04]
004DB4C0 F76DF8           imul dword ptr [ebp-$08]
004DB4C3 33D2             xor edx,edx              // zero out top 32 bits
004DB4C5 8945E8           mov [ebp-$18],eax
004DB4C8 8955EC           mov [ebp-$14],edx

Now, if the line xor edx,edx were not there, the result would be exactly what I need.

No, it isn't what you want at all. That's a signed multiply and the result is nonsense if you want an unsigned result. Make A:=$FFFFFFFF and B:=2 - the result of imul is EAX = FFFFFFFE and EDX = FFFFFFFF. This opcode is emitted even with two unsigned operands. You want the mul instruction, not imul. I don't think the delphi compiler will ever emit mul from pure pascal. From the documentation on *(emphasis mine)

The value of x / y is of type Extended, regardless of the types of x and y. For other arithmetic operators, the result is of type Extended whenever at least one operand is a real; otherwise, the result is of type Int64 when at least one operand is of type Int64; otherwise, the result is of type Integer.

Integer - signed. Given how dependent this is on the idiosyncracies of the architecture, and given the deficiencies of the delphi compilers, I think the only performant solution here is going to be target-dependent assembly.

function UMul3264(x, y : UInt32) : UInt64;
asm
  mul eax, edx
end;
0
votes

There is a Windows macro UInt32x32To64(a, b) to multiply two unsigned 32-bit values and get a 64-bit result.

If you need a pure pascal, you have to assign both of your 32-bit unsigned values to 64-bit unsigned values and then multiply them.

function UInt32x32To64(x, y: UInt32): UInt64;
var
  xl, yl: UInt64;
begin
  xl := x;
  yl := y;
  Result := xl * yl;
end;

Here is a code sample that checks this function. This code does also have an assembly function just for comparison. You don't need it because you have PurePascal, but it is implemented very efficiently - just one mul instruction. It is a special form of mul which takes just one argument, and the other comes from the eax register. The resulting 64-bit value is stored in edx:eax. So it is implemented more efficiently than casting 32-bit values to 64-bit and then multiplyng them, since Delphi calls __llmul for that from the System.pas, which does 3 mul instructions, each of them is costly.

program TestMultiply;
{$APPTYPE CONSOLE}

function UInt32x32To64_asm(x, y: UInt32): UInt64; assembler;
asm
  {$IFDEF WIN32}
  mul edx
  {$ELSE}
  mov eax, ecx
  mul edx
  shl rdx, 32
  or  rax, rdx
  {$ENDIF}
end;

function UInt32x32To64_purepascal(x, y: UInt32): UInt64;
var
  xl, yl: UInt64;
begin
  xl := x;
  yl := y;
  Result := xl * yl;
end;

var
  aa, bb: UInt32;
  cc, cc_test: UInt64;
begin
  aa := 2147483647;
  bb := 2148736590;
  cc_test := 4614376688735543730;
  cc := UInt32x32To64_asm(aa, bb);
  if cc <> cc_test then
    WriteLn('Error')
  else
    WriteLn('OK');
  cc := UInt32x32To64_purepascal(aa, bb);
  if cc <> cc_test then
    WriteLn('Error')
  else
    WriteLn('OK');

end.

Here is a Python3 code to check:

aa = 2147483647
bb = 2148736590
cc = aa * bb
print(cc)
print("Error") if cc != 4614376688735543730 else print("OK!")