Unsigned short int operation with Intel Intrinsics

Question

I want to do some operation using the Intel intrinsics (vector of unsigned int of 16 bit) and the operations are the following :

load or set from an array of unsigned short int.

Div and Mod operations with unsigned short int.

Multiplication operation with unsigned short int.

Store operation of unsigned short int into an array.

I looked into the Intrinsics guide but it looks like there are only intrinsics for short integers and not the unsigned ones. Could someone have any trick that could help me with this ?

In fact I'm trying to store an image of a specific raster format in an array with a specific ordering. So I have to calculate the index where each pixel value is going to be stored:

unsigned int Index(unsigned int interleaving_depth, unsigned int x_size, unsigned int y_size, unsigned int z_size, unsigned int Pixel_number)
{
   unsigned int x = 0, y = 0, z = 0, reminder = 0, i = 0;

   y = Pixel_number/(x_size*z_size);
   reminder = Pixel_number % (x_size*z_size);

   i = reminder/(x_size*interleaving_depth);
   reminder = reminder % (x_size*interleaving_depth);

   if(i == z_size/interleaving_depth){
       x = reminder/(z_size - i*interleaving_depth);
       reminder = reminder % (z_size - i*interleaving_depth);
    }
    else
    {
       x = reminder/interleaving_depth;
       reminder = reminder % interleaving_depth;        
    }

    z = interleaving_depth*i + reminder;
    if(z >= z_size)
       z = z_size - 1;

    return x + y*x_size + *x_size*y_size;
}

If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use github.com/HJLebbink/asm-dude/wiki/PMULLW on either. There are separate high-half multiply instructions for signed and unsigned short. — Peter Cordes
Of the operations listed here only division and modulo are affected by the signedness, the rest are equivalent in two's complement and unsigned representations. Also, I don't believe that there are any integer division/modulo instructions defined. What is the specific issue with the algorithm which you are attempting to implement? — doynax
If you are using intrinsics for performance, first check what kind of assembly normal C code produces at -O3. It might be compiler is smart enough to already do what you are trying to. — hyde
@chux: Not in 16-bits, only the upper half of the full 32-bit result is affected. — doynax
@A.nechi: If so then I would suggest iterating through the coordinate axes of the source pixel directly instead of reversing the coordinates from a flat index. The a step along each axis should then correspond to a fixed pitch addition, plus a bit of a special case handling for the clamping. — doynax

Peter Cordes Peter Cordes · Accepted Answer · 2017-12-20T21:32:42

If you only want the low half of the result, multiplication is the same binary operation for signed or unsigned. So you can use pmullw on either. There are separate high-half multiply instructions for signed and unsigned short, though: _mm_mulhi_epu16 (pmulhuw) vs. _mm_mulhi_epi16 (pmuluw)

Similarly, you don't need an _mm_set_epu16 because it's the same operation: on x86 casting to signed doesn't change the bit-pattern, so Intel only bothered to provide _mm_set_epi16, but you can use it with args like 0xFFFFu instead of -1 with no problems. (Using Intel intrinsics automatically means your code only has to be portable to x86 32 and 64 bit.)

Load / store intrinsics don't change the data at all.

SSE/AVX doesn't have integer division or mod instructions. If you have compile-time-constant divisors, do it yourself with a multiply/shift. You can look at compiler output to get the magic constant and shift counts (Why does GCC use multiplication by a strange number in implementing integer division?), or even let gcc auto-vectorize something for you. Or even use GNU C native vector syntax to divide:

#include <immintrin.h>

__m128i div13_epu16(__m128i a) 
{
    typedef unsigned short __attribute__((vector_size(16))) v8uw;
    v8uw tmp = (v8uw)a;
    v8uw divisor = (v8uw)_mm_set1_epi16(13);
    v8uw result = tmp/divisor;
    return (__m128i)result;

    // clang allows "lax" vector type conversions without casts
    // gcc allows vector / scalar, e.g. tmp / 13.  Clang requires set1

    // to work with both, we need to jump through all the syntax hoops
}

compiles to this asm with gcc and clang (Godbolt compiler explorer):

div13_epu16:
    pmulhuw xmm0, XMMWORD PTR .LC0[rip]       # tmp93,
    psrlw   xmm0, 2       # tmp95,
    ret

.section .rodata
.LC0:
    .value  20165
    # repeats 8 times

If you have runtime-variable divisors, it's going to be slower, but you can use http://libdivide.com/. It's not too bad if you reuse the same divisor repeatedly, so you only have to calculate a fixed-point inverse for it once, but code to use an arbitrary inverse needs a variable shift count which is less efficient with SSE (well also for integer), and potentially more instructions because some divisors require a more complicated sequence than others.

Unsigned short int operation with Intel Intrinsics

1 Answers