4
votes

I want to multiply the data stored in one xmm register with a single float value and save the result in a xmm register. I made a little graphic to explain it a bit better.

enter image description here

As you see I got a xmm0 register with my data in it. For example it contains:

xmm0 = |4.0|2.5|3.5|2.0|

Each floating point is stored in 4 bytes. My xmm0 register is 128 bits, 16 bytes long.

That works pretty good. Now I want to store 0.5 in another xmm register, e.g. xmm1, and multiply this register with the xmm0 register so that each value stored in xmm0 is multiplied with 0.5.

I have absolutely no idea how to store 0.5 in an XMM register. Any suggestions?

Btw: It's Inline Assembler in C++.

void filter(image* src_image, image* dst_image)
{
    float* src = src_image->data;
    float* dst = dst_image->data;

    __asm__ __volatile__ (              
        "movaps (%%esi), %%xmm0\n"      
        // Multiply %xmm0 with a float, e.g. 0.5
        "movaps %%xmm0, (%%edi)\n" 

        :
        : "S"(src), "D"(dst) :  
    );
}

This is the quiet simple version of the thing i want to do. I got some image data stored in a float array. The pointer to these arrays are passed to assembly. movaps takes the first 4 float values of the array, stores these 16 bytes in the xmm0 register. After this xmm0 should be multiplied with e.g. 0.5. Than the "new" values shall be stored in the array from edi.

5
It's better to use intrinsics nowadays. That way your code is compiler independent and you get automatic register allocation.Axel Gneiting

5 Answers

8
votes

As people noted in comments, for this sort of very simple operation, it's essentially always better to use intrinsics:

void filter(image* src_image, image* dst_image)
{
    const __m128 data = _mm_load_ps(src_image->data);
    const __m128 scaled = _mm_mul_ps(data, _mm_set1_ps(0.5f));
    _mm_store_ps(dst_image->data, scaled);
}

You should only resort to an inline ASM if the compiler is generating bad code (and only after filing a bug with the compiler vendor).

If you really want to stay in assembly, there are many ways to accomplish this task. You could define a scale vector outside of the ASM block:

    const __m128 half = _mm_set1_ps(0.5f);

and then use it inside the ASM just like you use other operands.

You can do it without any loads, if you really want to:

    "mov    $0x3f000000, %%eax\n"  // encoding of 0.5
    "movd   %%eax,       %%xmm1\n" // move to xmm1
    "shufps $0, %%xmm1,  %%xmm1\n" // splat across all lanes of xmm1

Those are just two approaches. There are lots of other ways. You might spend some quality time with the Intel Instruction Set Reference.

4
votes

Assuming you're using intrinsics: __m128 halfx4 = _mm_set1_ps(0.5f);

Edit:

You're much better off using intrinsics:

__m128 x = _mm_mul_ps(_mm_load_ps(src), halfx4);
_mm_store_ps(dst, x);

If the src and dst float data is not 16-byte aligned, you need: _mm_loadu_ps and _mm_storeu_ps - which are slower.

2
votes

You are looking for the MOVSS instruction (which loads a single precision float from memory into the lowest 4 bytes of an SSE register), followed by a shuffle to fill the other 3 floats with this value:

movss  (whatever), %%xmm1
shufps %%xmm1, %%xmm1, $0

That's also how the _mm_set1_ps intrinsic might probably do it. Then you can just multiply these SSE values or do whatever you want:

mulps %%xmm1, %%xmm0
0
votes

If you are using c++ with gcc and have EasySSE your code can be as follows

void filter(float* src_image, float* dst_image){
    *(PackedFloat128*)dst_image =  Packefloat128(0.5) * (src_image+0);
}

This is assuming the given pointers are 16byte aligned. You can check the assy code to verify the variables are properly mapped to vector registers.

0
votes

Here's one way to do it:

#include <stdio.h>
#include <stdlib.h>

typedef struct img {
    float *data;
} image_t;

image_t *src_image;
image_t *dst_image;
void filter(image_t*, image_t*);

int main()
{
    image_t src, dst;
    src.data = malloc(64);
    dst.data = malloc(64);
    src_image=&src;
    dst_image=&dst;

    *src.data = 42.0;
    filter(src_image, dst_image);

    printf("%f\n", *dst.data);
    free(src.data);
    free(dst.data);
    return 0;
}

void filter(image_t* src_image, image_t* dst_image)
{
    float* src = src_image->data;
    float* dst = dst_image->data;

    __asm__ __volatile__ (              
        "movd   %%esi, %%xmm0;"
        "movd   %%xmm0, %%edi;"
        : "=D" (*dst)
        : "S" (*src)
    );
}