The obvious scalar way, like you'd get from a compiler (http://godbolt.org/):
movzx eax, byte [mem] ; zero extend. Use movsx to sign-extend
cvtsi2ss xmm0, eax
This costs 3 total uops on Sandybridge-family. (cvtsi2ss is 2).
Note that cvtsi2ss is poorly designed and merges into the old value of XMM0, so it has a false dependency. gcc tends to pxor xmm0,xmm0 first to break the dependency, but if XMM0 wasn't recently use then you should be fine. With AVX, you can zero one XMM register and then repeatedly use it as a safe no-dependency source for multiple converts.
vxorps xmm0, xmm0, xmm0
;then repeated multiple times:
vcvtsi2ss xmm1, xmm0, eax ; xmm1 is write-only, no false dep
If SSE4.1 is available, and it's ok to read 3 bytes past the byte you want (without segfaulting from reading an unmapped page, and without perf problems from cache-line or page splits), then you could do this:
pmovzxbd xmm0, dword [mem] ; byte->dword packed zero extend
cvtdq2ps xmm1, xmm0 ; packed-convert of int32 to float
This costs 2 total uops on SnB-family: pmovzx/sx with an XMM destination can micro-fuse a load. (But not the AVX2 YMM version). (http://agner.org/optimize/).
Of course this is excellent if you actually want to convert 4 consecutive bytes. Otherwise you might shuffle to set up for a cvt instruction if you have multiple conversions.
cvtdq2psto convert the 32 bit int to a float. - Paul R