I'm trying to write a function that reverses order of characters in a string using x86 NASM assembly language. I tried doing it using registers (I know it's more efficient to do it using stack) but I keep getting a segmentation fault, the c declaration looks as following
extern char* reverse(char*);
The assembly segment:
section .text
global reverse
reverse:
push ebp ; prologue
mov ebp, esp
mov eax, [ebp+8] ; eax <- points to string
mov edx, eax
look_for_last:
mov ch, [edx] ; put char from edx in ch
inc edx
test ch, ch
jnz look_for_last ; if char != 0 loop
sub edx, 2 ; found last
swap: ; eax = first, edx = last (characters in string)
test eax, edx
jg end ; if eax > edx reverse is done
mov cl, [eax] ; put char from eax in cl
mov ch, [edx] ; put char from edx in ch
mov [edx], cl ; put cl in edx
mov [eax], ch ; put ch in eax
inc eax
dec edx
jmp swap
end:
mov eax, [ebp+8] ; move char pointer to eax (func return)
pop ebp ; epilogue
ret
It seems like the line causing the segmentation fault is
mov cl, [eax]
Why is that happening? In my understanding eax never goes beyond the bounds of the string so there always is something in [eax]. How come I get a segmentation fault?
pushfor every single byte of a string. Your specific implementation is not optimized, but it's still probably faster. (e.g. you don't usebswapto reverse 4 bytes at a time, and on some CPUs (AMD) writing cl and ch separately will have a false dependency on ECX, and your loop structure with ajmpat the bottom and also a conditional branch at the top is inefficient). - Peter Cordesstrlen(which you calllook_for_last) to go 16 bytes at a time with SSE2, or at least a 4-byte-at-a-time bithack. Why does glibc's strlen need to be so complicated to run quickly?. It hurts some to have to do that work before you can even start swapping, while a push/pop implementation could do that at the same time. But push/pop also needs two loops, and both of them copy and touch 5x (= 4+1) as much memory as your read-only and in-place loops. You could even skip the strlen if you were reversing into a large buf - Peter Cordespshufbyou can reverse 16 bytes per clock cycle on modern x86 CPUs, vs. 1 for yours. I don't think AVX2 would help until maybe IceLake; other CPUs would bottleneck on shuffle throughput. AVX512VBMIvpermbon IceLake would be good, though: single uop for for 64 bytes at a time. See also gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 - GCC missed optimizations in auto-vectorizing a word (akashort) reverse loop. - Peter Cordes