1
votes

How can I retrieve a single byte from an address in memory and move its value as a float number into an xmm register? (E.g., if there is a byte 123 at the address location, I want to be able to do floating point arithmetics on this value, 123+5 etc., using sse instructions.)

I am new to assembly, and I hope that the question makes sense. I have tried several things rather randomly (such as moving to al first and to xmm from there - but don't know how to proceed to convert to float...); maybe someone can point me in the right direction?

1
You need to unpack the 8 bit value to 32 bits, then use e.g. cvtdq2ps to convert the 32 bit int to a float. - Paul R
Check the documentation, especially all instructions marked as “Convert.” - fuz
@fuz Thanks for the great link - why couldn't I find this by myself?? - Duke
@Duke I dunno. Work on your Google fu! - fuz

1 Answers

4
votes

The obvious scalar way, like you'd get from a compiler (http://godbolt.org/):

movzx     eax,  byte [mem]         ; zero extend.  Use movsx to sign-extend
cvtsi2ss  xmm0, eax

This costs 3 total uops on Sandybridge-family. (cvtsi2ss is 2).

Note that cvtsi2ss is poorly designed and merges into the old value of XMM0, so it has a false dependency. gcc tends to pxor xmm0,xmm0 first to break the dependency, but if XMM0 wasn't recently use then you should be fine. With AVX, you can zero one XMM register and then repeatedly use it as a safe no-dependency source for multiple converts.

vxorps   xmm0, xmm0, xmm0

;then repeated multiple times:
vcvtsi2ss  xmm1, xmm0, eax       ; xmm1 is write-only, no false dep

If SSE4.1 is available, and it's ok to read 3 bytes past the byte you want (without segfaulting from reading an unmapped page, and without perf problems from cache-line or page splits), then you could do this:

pmovzxbd    xmm0,  dword [mem]       ; byte->dword packed zero extend
cvtdq2ps    xmm1,  xmm0              ; packed-convert of int32 to float

This costs 2 total uops on SnB-family: pmovzx/sx with an XMM destination can micro-fuse a load. (But not the AVX2 YMM version). (http://agner.org/optimize/).

Of course this is excellent if you actually want to convert 4 consecutive bytes. Otherwise you might shuffle to set up for a cvt instruction if you have multiple conversions.