5
votes

In Visual C++, _umul128 is undefined when targeting Windows 32 bits. How can two unsigned 64 bit integers be multiplied when targeting Win32? The solution only needs to work on Visual C++ 2017 targeting Windows 32 bits.

2
Do you need to do a lot of this? It could possibly be worth using SSE2 or AVX2, depending on what you want to do. Although probably not, because there's no SIMD add-with-carry. But still, see stackoverflow.com/questions/17863411/… for getting a 64-bit result. It would take significantly more instructions to get a 128b result, though. stackoverflow.com/questions/6738283/…Peter Cordes
SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit, possible speedup for 32-bit code vs. using adc + mul (or BMI2 mulx) as a building blocks for large arrays of numbers if you have AVX2 or AVX512, but for one at a time scalar is probably better.Peter Cordes

2 Answers

3
votes

This answer has a version of the xmrrig function from the other answer optimized for MSVC 32-bit mode. The original is fine with other compilers, especially clang.


I looked at MSVC's output for @Augusto's function, and it's really bad. Using __emulu for 32x32 => 64b multiplication improved it significantly (because MSVC is dumb and doesn't optimize uint64_t * uint64_t = uint64_t for the case where the inputs are known to really only be 32-bit, with the upper half zero). Other compilers (gcc and clang) generate a single mul instruction instead of calling a helper function. There are other problems with MSVC's code-gen for this that I don't know how to fix by tweaking the source, though. I think if you want good performance on that compiler, you'll have to use inline asm (or a separately-compiled asm function).

If you need more flexible arbitrary-precision (larger numbers), see GMPlib's low-level functions which have asm implementations, instead of trying to build a 256b multiply out of this __umul128. But if you need this exactly, then it's worth trying. Sticking with C++ enables constant-propagation and CSE optimizations that you wouldn't get with asm.


clang compiles this with no major problems, actually using adc for all the add-with-carry (except one which it saves with a setc instruction). MSVC branches on the carry checks, and just makes nasty code. GCC doesn't do a very good job either, with some branching on carry. (Because gcc doesn't know how to turn carry = sum < a into an adc, gcc bug 79173.)

IDK if MSVC or gcc support any add-with-carry intrinsics for 64-bit integers in 32-bit mode. _addcarry_u64 generates poor code with gcc anyway (in 64-bit mode), but ICC might do ok. IDK about MSVC.

If you want an asm implementation, I'd suggest using clang 5.0's output from this function. You could probably find some optimizations by hand, but it's certainly better than MSVCs. But of course most of the arguments in https://gcc.gnu.org/wiki/DontUseInlineAsm apply: blocking constant-propagation is a major problem if you ever multiply by something that inlining turns into a constant, or by a number with the upper half known to be zero.

Source + asm output for MSVC 32-bit and clang5.0 32-bit on Godbolt

nice code with clang. Kinda bad code with MSVC, but better than before. Kinda bad with gcc also (no change vs. other answer).

#include <stdint.h>

#ifdef _MSC_VER
#  include <intrin.h>
#else
// MSVC doesn't optimize 32x32 => 64b multiplication without its intrinsic
// But good compilers can just use this to get a single mul instruction
static inline
uint64_t __emulu(uint32_t x, uint32_t y) {
     return x * (uint64_t)y;
}
#endif

// This is still pretty ugly with MSVC, branching on the carry
//  and using XMM store / integer reload to zero a register!
// But at least it inlines 4 mul instructions
//  instead of calling a generic 64x64 => 64b multiply helper function
uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = (uint32_t)multiplier; // & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = (uint32_t)multiplicand; // & 0xFFFFFFFF;

    //uint64_t ac = __emulu(a, c);
    uint64_t ad = __emulu(a, d);
    //uint64_t bc = __emulu(b, c);
    uint64_t bd = __emulu(b, d);

    uint64_t adbc = ad + __emulu(b , c);
    uint64_t adbc_carry = (adbc < ad); // ? 1 : 0;
    // MSVC gets confused by the ternary and makes worse code than using a boolean in an integer context for 1 : 0

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = (product_lo < bd); // ? 1 : 0;
    *product_hi = __emulu(a , c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}

Make sure you only use this in 32-bit code. In 64-bit code, it fails to optimize to a single 64-bit mul instruction (which produces both 64-bit halves of the full result). Compilers that implement the GNU C++ extensions (clang, gcc, ICC) can use unsigned __int128 and get good code. e.g. a * (unsigned __int128)b produces a 128b result. (Example on Godbolt).

2
votes

I found the following code (from xmrrig), which seems to do the job just fine:

static inline uint64_t __umul128(uint64_t multiplier, uint64_t multiplicand, 
    uint64_t *product_hi) 
{
    // multiplier   = ab = a * 2^32 + b
    // multiplicand = cd = c * 2^32 + d
    // ab * cd = a * c * 2^64 + (a * d + b * c) * 2^32 + b * d
    uint64_t a = multiplier >> 32;
    uint64_t b = multiplier & 0xFFFFFFFF;
    uint64_t c = multiplicand >> 32;
    uint64_t d = multiplicand & 0xFFFFFFFF;

    //uint64_t ac = a * c;
    uint64_t ad = a * d;
    //uint64_t bc = b * c;
    uint64_t bd = b * d;

    uint64_t adbc = ad + (b * c);
    uint64_t adbc_carry = adbc < ad ? 1 : 0;

    // multiplier * multiplicand = product_hi * 2^64 + product_lo
    uint64_t product_lo = bd + (adbc << 32);
    uint64_t product_lo_carry = product_lo < bd ? 1 : 0;
    *product_hi = (a * c) + (adbc >> 32) + (adbc_carry << 32) + product_lo_carry;

    return product_lo;
}