I have written some x86 asm - fpu routine to normalize a vector of three floats - here it is
_asm_normalize10:; Function begin
push ebp ; 002E _ 55
mov ebp, esp ; 002F _ 89. E5
mov eax, dword [ebp+8H] ; 0031 _ 8B. 45, 08
fld dword [eax] ; 0034 _ D9. 00
fmul st0, st(0) ; 0036 _ DC. C8
fld dword [eax+4H] ; 0038 _ D9. 40, 04
fmul st0, st(0) ; 003B _ DC. C8
fld dword [eax+8H] ; 003D _ D9. 40, 08
fmul st0, st(0) ; 0040 _ DC. C8
faddp st1, st(0) ; 0042 _ DE. C1
faddp st1, st(0) ; 0044 _ DE. C1
fsqrt ; 0046 _ D9. FA
fld1 ; 0048 _ D9. E8
fdivrp st1, st(0) ; 004A _ DE. F1
fld dword [eax] ; 004C _ D9. 00
fmul st(0), st1 ; 004E _ D8. C9
fstp dword [eax] ; 0050 _ D9. 18
fld dword [eax+4H] ; 0052 _ D9. 40, 04
fmul st(0), st1 ; 0055 _ D8. C9
fstp dword [eax+4H] ; 0057 _ D9. 58, 04
fld dword [eax+8H] ; 005A _ D9. 40, 08
fmulp st1, st(0) ; 005D _ DE. C9
fstp dword [eax+8H] ; 005F _ D9. 58, 08
pop ebp ; 0062 _ 5D
ret ; 0063 _ C3
; _asm_normalize10 End of function
[It is my code ;-) It works and was tested by me]
I do not know x86 assembly to much and I would like to find some optimization of the above (pure fpu old asm especially without sse but somewhat more optimized than above)
Especially I wonder if there is some lame coding in this thing above: I load x y z vector on fpu stack then count 1/sqrt(x*x+y*y+z*z) then load x y z from ram again and multiply by value then store -
Is this an suboptimisation and I should try load x y z only once (not twice) then hold it on fpu stack count and then store at end ?