1
votes

I'm trying to write a function that reverses order of characters in a string using x86 NASM assembly language. I tried doing it using registers (I know it's more efficient to do it using stack) but I keep getting a segmentation fault, the c declaration looks as following

extern char* reverse(char*);

The assembly segment:

section .text
global reverse
reverse:
        push ebp           ; prologue
        mov ebp, esp       
        mov eax, [ebp+8]   ; eax <- points to string
        mov edx, eax
look_for_last:
        mov ch, [edx]      ; put char from edx in ch
        inc edx
        test ch, ch        
        jnz look_for_last  ; if char != 0 loop
        sub edx, 2         ; found last
swap:                      ; eax = first, edx = last (characters in string)
        test eax, edx       
        jg end             ; if eax > edx reverse is done
        mov cl, [eax]      ; put char from eax in cl
        mov ch, [edx]      ; put char from edx in ch
        mov [edx], cl      ; put cl in edx
        mov [eax], ch      ; put ch in eax
        inc eax
        dec edx
        jmp swap            
end:
        mov eax, [ebp+8]   ; move char pointer to eax (func return)
        pop ebp            ; epilogue
        ret

It seems like the line causing the segmentation fault is

mov cl, [eax]

Why is that happening? In my understanding eax never goes beyond the bounds of the string so there always is something in [eax]. How come I get a segmentation fault?

1
I know it's more efficient to do it using stack No, it isn't! in-place reversal is far more efficient than doing a 4-byte push for every single byte of a string. Your specific implementation is not optimized, but it's still probably faster. (e.g. you don't use bswap to reverse 4 bytes at a time, and on some CPUs (AMD) writing cl and ch separately will have a false dependency on ECX, and your loop structure with a jmp at the bottom and also a conditional branch at the top is inefficient). - Peter Cordes
Also of course you could use SIMD for strlen (which you call look_for_last) to go 16 bytes at a time with SSE2, or at least a 4-byte-at-a-time bithack. Why does glibc's strlen need to be so complicated to run quickly?. It hurts some to have to do that work before you can even start swapping, while a push/pop implementation could do that at the same time. But push/pop also needs two loops, and both of them copy and touch 5x (= 4+1) as much memory as your read-only and in-place loops. You could even skip the strlen if you were reversing into a large buf - Peter Cordes
Anyway, with SSSE3 for pshufb you can reverse 16 bytes per clock cycle on modern x86 CPUs, vs. 1 for yours. I don't think AVX2 would help until maybe IceLake; other CPUs would bottleneck on shuffle throughput. AVX512VBMI vpermb on IceLake would be good, though: single uop for for 64 bytes at a time. See also gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 - GCC missed optimizations in auto-vectorizing a word (aka short) reverse loop. - Peter Cordes
(I know that's way more optimization than you're probably ready to digest yet, but some of it might be interesting. See also stackoverflow.com/tags/x86/info for other guides, docs, and performance links.) - Peter Cordes

1 Answers

1
votes

Ok I figured it out, I mistakenly used test eax, edx instead of which I should have used cmp eax, edx. It works now.