24
votes

I'm still working on routines for arbitrary long integers in C++. So far, I have implemented addition/subtraction and multiplication for 64-bit Intel CPUs.

Everything works fine, but I wondered if I can speed it a bit by using SSE. I browsed through the SSE docs and processor instruction lists, but I could not find anything I think I can use and here is why:

  • SSE has some integer instructions, but most instructions handle floating point. It doesn't look like it was designed for use with integers (e.g. is there an integer compare for less?)

  • The SSE idea is SIMD (same instruction, multiple data), so it provides instructions for 2 or 4 independent operations. I, on the other hand, would like to have something like a 128 bit integer add (128 bit input and output). This doesn't seem to exist. (Yet? In AVX2 maybe?)

  • The integer additions and subtractions handle neither input nor output carries. So it's very cumbersome (and thus, slow) to do it by hand.

My question is: is my assessment correct or is there anything I have overlooked? Can long integer routines benefit from SSE? In particular, can they help me to write a quicker add, sub or mul routine?

1

1 Answers

32
votes

In the past, the answer to this question was a solid, "no". But as of 2017, the situation is changing.

But before I continue, time for some background terminology:

  1. Full Word Arithmetic
  2. Partial Word Arithmetic


Full-Word Arithmetic:

This is the standard representation where the number is stored in base 232 or 264 using an array of 32-bit or 64-bit integers. Many bignum libraries and applications (including GMP) use this representation.

In full-word representation, every integer has a unique representation. Operations like comparisons are easy. But stuff like addition are more difficult because of the need for carry-propagation.

It is this carry-propagation that makes bignum arithmetic almost impossible to vectorize.


Partial-Word Arithmetic

This is a lesser-used representation where the number uses a base less than the hardware word-size. For example, putting only 60 bits in each 64-bit word. Or using base 1,000,000,000 with a 32-bit word-size for decimal arithmetic.

The authors of GMP call this, "nails" where the "nail" is the unused portion of the word.

In the past, use of partial-word arithmetic was mostly restricted to applications working in non-binary bases. But nowadays, it's becoming more important in that it allows carry-propagation to be delayed.


Problems with Full-Word Arithmetic:

Vectorizing full-word arithmetic has historically been a lost cause:

  1. SSE/AVX2 has no support for carry-propagation.
  2. SSE/AVX2 has no 128-bit add/sub.
  3. SSE/AVX2 has no 64 x 64-bit integer multiply.*

*AVX512-DQ adds a lower-half 64x64-bit multiply. But there is still no upper-half instruction.

Furthermore, x86/x64 has plenty of specialized scalar instructions for bignums:

  • Add-with-Carry: adc, adcx, adox.
  • Double-word Multiply: Single-operand mul and mulx.

In light of this, both bignum-add and bignum-multiply are difficult for SIMD to beat scalar on x64. Definitely not with SSE or AVX.

With AVX2, SIMD is almost competitive with scalar bignum-multiply if you rearrange the data to enable "vertical vectorization" of 4 different (and independent) multiplies of the same lengths in each of the 4 SIMD lanes.

AVX512 will tip things more in favor of SIMD again assuming vertical vectorization.

But for the most part, "horizontal vectorization" of bignums is largely still a lost cause unless you have many of them (of the same size) and can afford the cost of transposing them to make them "vertical".


Vectorization of Partial-Word Arithmetic

With partial-word arithmetic, the extra "nail" bits enable you to delay carry-propagation.

So as long as you as you don't overflow the word, SIMD add/sub can be done directly. In many implementations, partial-word representation uses signed integers to allow words to go negative.

Because there is (usually) no need to perform carryout, SIMD add/sub on partial words can be done equally efficiently on both vertically and horizontally-vectorized bignums.

Carryout on horizontally-vectorized bignums is still cheap as you merely shift the nails over the next lane. A full carryout to completely clear the nail bits and get to a unique representation usually isn't necessary unless you need to do a comparison of two numbers that are almost the same.

Multiplication is more complicated with partial-word arithmetic since you need to deal with the nail bits. But as with add/sub, it is nevertheless possible to do it efficiently on horizontally-vectorized bignums.

AVX512-IFMA (coming with Cannonlake processors) will have instructions that give the full 104 bits of a 52 x 52-bit multiply (presumably using the FPU hardware). This will play very well with partial-word representations that use 52 bits per word.


Large Multiplication using FFTs

For really large bignums, multiplication is most efficiently done using Fast-Fourier Transforms (FFTs).

FFTs are completely vectorizable since they work on independent doubles. This is possible because fundamentally, the representation that FFTs use is a partial word representation.


To summarize, vectorization of bignum arithmetic is possible. But sacrifices must be made.

If you expect SSE/AVX to be able to speed up some existing bignum code without fundamental changes to the representation and/or data layout, that's not likely to happen.

But nevertheless, bignum arithmetic is possible to vectorize.


Disclosure:

I'm the author of y-cruncher which does plenty of large number arithmetic.