1
votes

I'm trying to load/store a memory from/to a char pointer array using the XMM0 128-bit register on a 32-bit operating system.

What I tried is very simple:

int main() {
    char *data = new char[33];
    for (int i = 0; i < 32; i++)
        data[i] = 'a';
    data[32] = 0;
    ASM
    {
        movdqu xmm0,[data]
    }

    delete[] data;
}

The problem is that this doesn't seem to work. The first time I debugged the Win32 application I got:

xmm0 = 0024F8380000000000F818E30055F158

The second time I debugged it I got:

xmm0 = 0043FD6800000000002C18E3008CF158

So there must be something with the line:

movdqu xmm0,[data]

I tried using this instead:

movdqu xmm0,data

but I got the same result.

What I thought was the problem is that I copy the address instead of the data at the address. However the value shown at the xmm0 register is too large for a 32-bit address, so it must be copying memory from another address.

I also tried some other instructions I found at the internet, but with the same result.

Is it the way I'm passing the pointer or am I misunderstanding something about xmm basics?

A valid solution with an explanation will be appreciated.

Even though I found the solution (finally after three hours), I would still like an explanation:

ASM
    {
        push eax
        mov eax,data
        movdqu xmm0,[eax]
        pop eax
    }

Why should I pass the pointer to a 32-bit register?

2
Note that data is a pointer.Margaret Bloom
Can you please try if local variable char data[33]; instead of new/delete with pointer can be used directly, as in original post with [data]? I can't debug now, but I think this may work, as I can imagine the compiled source. What is puzzling me at the moment, what is the C++ difference from char *data. From the C++ point of view they look to be equivalent. I'm probably overlooking something. (and in that second version, that mov eax,data is compiled to mov eax,[data], right?)Ped7g
x86 does not have a "memory indirect" addressing mode. You are loading the pointer into xmm0. Since xmm0 is larger than a pointer, you are also reading garbage bytes in memory beyond the end of where the pointer is stored.Raymond Chen
It works with non-pointers.Yep it is compiled to mov eax,[data] (Ped7g). Hmm what is 'data' actually at assembly? Maybe I have mistaken how ASM threats 'data'. I thought 'data' is threated as it's threated in c++. It's seems that it returns the pointer of the 'data' varriable not to the 'data' pointer address as it would in C++. That explaination looks logical to meuser2377766
The recommended way of doing this is the mm_loadu_si128 intrinsic.Raymond Chen

2 Answers

1
votes
#include <iostream>

int main()
{
    char *dataptr = new char[33];
    char datalocal[33];
    dataptr[0] = 'a';   dataptr[1] = 0;
    datalocal[0] = 'a'; datalocal[1] = 0;
    printf("%p %p %c\n", dataptr, &dataptr, dataptr[0]);
    printf("%p %p %c\n", datalocal, &datalocal, datalocal[0]);
    delete[] dataptr;
}

Output:

0xd38050 0x7635bd709448 a
0x7635bd709450 0x7635bd709450 a

As we can see, the dynamic pointer data is really a pointer variable (32 bits or 64 bits at 0x7635BD709448), containing a pointer to the heap, 0xD38050.

The local variable is directly a 33 characters long buffer, allocated at address 0x7635BD709450.

But the datalocal works also as a char * value.

I'm a bit confused what the formal C++ explanation of this is. While writing C++ code, this feels quite natural and dataptr[0] is the first element in the heap memory (that is, dereferencing dataptr twice), but in assembler you see the true nature of dataptr, which is address of the pointer variable. So you have first to load the heap pointer by mov eax,[data] = loads eax with 0xD38050, and then you can load the content of 0xD38050 into XMM0 by using [eax].

With a local variable there is no variable with the address of it; the symbol datalocal is already the address of the first element, so movdqu xmm0,[data] will work then.

In the "wrong" case you can still do movdqu xmm0,[data]; it's not a problem of the CPU to load 128 bits from a 32-bit variable. It will simply continue reading beyond the 32 bits and read another 96 bits belonging to other variables/code. In case you are around a memory boundary and this is the last memory page of the application, it will crash on an invalid access.


Alignment were mentioned a few times in comments. That's a valid point; to access the memory through movdqu it should be aligned. Check your C++ compiler intrinsics. For Visual Studio this should work:

__declspec(align(16)) char datalocal[33];
char *dataptr = _aligned_malloc(33, 16);
_aligned_free(dataptr);

About my C++ interpretation: Maybe I got this wrong since the beginning.

The dataptr is the value of the dataptr symbol, that is, that heap address. Then dataptr[0] is dereferencing the heap address, accessing the first element of the allocated memory. &dataptr is the address of the dataptr value. This makes sense also with syntax like dataptr = nullptr;, where you are storing the nullptr value into the dataptr variable, not overwriting the dataptr symbol address.

With datalocal[] there's basically no sense in accessing the pure datalocal, like in datalocal = 'a';, as it's an array variable, so you should always provide the [] index. And &datalocal is the address of such an array. The pure datalocal is then an aliased shortcut for easier point math with arrays, etc., having also the char * type, but if the pure datalocal would throw a syntax error, it would still be possible to write C++ code (using &datalocal for pointer, datalocal[..] for elements), and it would fit with that dataptr logic completely.

Conclusion: You had your example wrong since the beginning, because in assembly language [data] is loading the value of data, which is the pointer to the heap returned by new.

This is my own explanation, and now some C++ expert will come and tear it to pieces from a formal point of view... :)))

3
votes

The problem with your code is data is a pointer. The assembly code movdqu xmm0,[data] loads the 16 bytes at the address of data into register xmm0. This means the 4 or 8 bytes comprising the value of the pointer and whatever bytes that follow in memory. You are lucky the pointer address is correctly aligned in memory, otherwise you would get a segmentation fault. Nothing guarantees this alignment.

The alternative using an automatic array char data[33]; would solve the addressing problem (movqdu would load data from the array) but not the alignment issue, you could still get a violation depending on how the compiler aligns the array with automatic storage. Again, no guarantee for proper alignment.

The solution you found is probably a good approach, but unlike malloc(), I am not sure if the pointer returned by new is valid for any alignment.

This should work in all cases:

#include <stdlib.h>

int main(void) {
    char *data = malloc(33);
    for (int i = 0; i < 32; i++) {
        data[i] = 'a';
    }
    data[32] = 0;
    __asm {
        mov    eax,  data
        movdqu xmm0, [eax]
    }
    free(data);
    return 0;
}

As commented by Peter Cordes, it is much better to use intrinsics for this kind of thing, namely mm_loadu_si128. There are two primary reasons: first, inline assembly is not supported for 64-bit builds, so by using intrinsics, your code becomes slightly more portable. Second, the compiler does a relatively poor job of optimizing inline assembly, and in particular, tends to do a lot of pointless memory stores and loads. The compiler does a much better job optimizing intrinsics, which makes your code run faster (which is the whole point in using inline assembly!).