1
votes

I'm trying to assemble this code using Keystone and execute it with the Unicorn engine:

start:
    add r0, r0, #1
    add r1, r1, #2
    bl start
    b start

In my opinion, the bl instruction should save the address of the next instruction to the lr register and then jump to start. So it'll be an infinite loop that adds 1 to r0 and 2 to r1.

Apparently, I'm wrong, because bl start branches to itself instead!

I'm using Python wrappers for Keystone, Capstone and Unicorn to process the assembly. Here's my code:

import keystone as ks
import capstone as cs
import unicorn as uc

print(f'Keystone {ks.__version__}\nCapstone {cs.__version__}\nUnicorn {uc.__version__}\n')


code = '''
start:
    add r0, r0, #1
    add r1, r1, #2
    bl start
    b start
'''

assembler = ks.Ks(ks.KS_ARCH_ARM, ks.KS_MODE_THUMB)
disassembler = cs.Cs(cs.CS_ARCH_ARM, cs.CS_MODE_THUMB)
emulator = uc.Uc(uc.UC_ARCH_ARM, uc.UC_MODE_THUMB)

machine_code, _ = assembler.asm(code)
machine_code = bytes(machine_code)
print(machine_code.hex())

initial_address = 0
for addr, size, mnem, op_str in disassembler.disasm_lite(machine_code, initial_address):
    instruction = machine_code[addr:addr + size]
    print(f'{addr:04x}|\t{instruction.hex():<8}\t{mnem:<5}\t{op_str}')

emulator.mem_map(initial_address, 1024)  # allocate 1024 bytes of memory
emulator.mem_write(initial_address, machine_code)  # write the machine code
emulator.hook_add(uc.UC_HOOK_CODE, lambda uc, addr, size, _: print(f'Address: {addr}'))
emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)

This is what it outputs:

Keystone 0.9.1
Capstone 5.0.0
Unicorn 1.0.2

00f1010001f10201fff7fefff8e7
0000|   00f10100    add.w   r0, r0, #1
0004|   01f10201    add.w   r1, r1, #2
0008|   fff7feff    bl      #8         ; why not `bl #0`?
000c|   f8e7        b       #0
Address: 0
Address: 4
Address: 8  # OK, we arrived at BL start
Address: 8  # we're at the same instruction again?
Address: 8  # and again?
Address: 8
< ... >
Address: 8
Address: 8
Traceback (most recent call last):
  File "run_ARM_bug.py", line 32, in <module>
    emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/unicorn-1.0.2rc3-py3.7.egg/unicorn/unicorn.py", line 317, in emu_start
unicorn.unicorn.UcError: Emulation timed out (UC_ERR_TIMEOUT)

The exception is not a problem (I set the timeout myself). The problem is that bl start always jumps to itself instead of start.

If I jump forward, however, everything will work as expected, so this works - bl jumps to the correct address:

start:
    ; stuff
    bl next
    ; hello

next:
    add r0, r0, #1
    bkpt

EDIT

I went on and assembled this code with Clang:

; test.s

.text
.syntax unified
.globl  start       
.p2align    1
.code   16       
.thumb_func
start:
    add r0, r0, #1
    add r1, r1, #2
    bl start
    b start

Used the following commands:

$ clang -c test.s -target armv7-unknown-linux -o test.bin -mthumb
clang-11: warning: unknown platform, assuming -mfloat-abi=soft

And then disassembled test.bin with objdump:

$ objdump -d test.bin

test.bin:       file format elf32-littlearm


Disassembly of section .text:

00000000 <start>:
       0: 00 f1 01 00                   add.w   r0, r0, #1
       4: 01 f1 02 01                   add.w   r1, r1, #2
       8: ff f7 fe ff                   bl      #-4
       c: ff f7 fe bf                   b.w     #-4 <start+0x10>
$ 

So bl's argument is actually an offset. It's negative because we're going backwards. BUT, as the documentation says:

For B, BL, CBNZ, and CBZ instructions, the value of the PC is the address of the current instruction plus 4 bytes.

So bl #-4 will jump to (the address of bl) + 4 bytes - 4 bytes, or, in other words, itself, again!

So, I can't bl backwards for some reason? What's happening here and how to fix it?

1
did you figure it out? find a different triple with a generic clang that works?old_timer
@old_timer, oh my, I built a recursive factorial function that uses the same bl instruction with Clang 3.7 (!) on armv7-apple-darwin (old jailbroken iPhone 4), and it worked, but the generated machine code was nothing like the one here. I rebuilt the same assembly for armv7-none-eabi (sadly, I don't have anywhere to run it) with like three different versions of Clang, and all of them generated the bl instruction that, according to objdump, was going to jump to itself. So the only "correct" version was generated by that old Clang for the iPhone. I have no clue whyForceBru
Also, the assembly in your answer shows f7ff fffa as machine code for bl start, but in my case with Keystone it's swapped like fff7 feff (should've been f7ff fffe, I guess, but that's still fe, not fa). Looks like an endianness issue. BTW, I think Keystone & friends actually use Clang's code to assemble and disassemble, so I'm thinking it's kinda pointless to compare it to "actual" Clang... Anyway, only that old Clang 3.7 generated something that definitely worked. I'm gonna try and summarize my research about the cursed bl instruction in Clang tomorrow and update the question.ForceBru
@old_timer, here are some of my findings: gist.github.com/ForceBru/e5c487607342cbd853bdb2e31cd5fcd8. Looks like this has to do with the resulting file being an object file or an actual executable. I think I've already seen a question about this on SO...ForceBru
DOH, didnt think about that, of course, some toolchains wont resolve until link time, instructions like these are definitely handled by the linker in all cases, so why not have the linker do it all the time even if the assembler knows the answer. thats probably it easy to test.old_timer

1 Answers

1
votes

All tool "chain" linkers have to deal with function calls or other to external resources, you will see instructions like bl encoded as a branch to self or branch to zero or some such incomplete instruction (certainly for external labels). The tangent here is that some versions of clang appear to sometimes encode for a local address and sometimes not (at the assembler level). But when linked the offset/address is patched up (as in this case).

A generic clang (all targets, default x86 host) 3.7 at the object level gives the right instruction. 3.8 doesn't. That appears to be the time this change happened. Clang 10 generic doesn't but a hand built clang 10.0.0 specific to one target, does give the right answer at assemble time.

All of this is a tangent because that is at assembly time not final output. When linked you get the right answer (thus far, the OP may have other cases where it didn't).

.thumb
.syntax unified
.thumb_func
start:
    add r0, r0, #1
    add r1, r1, #2
    bl start
    b start

clang-3.8 -c so.s -target armv7-unknown-linux -o so.o
clang: warning: unknown platform, assuming -mfloat-abi=soft
arm-none-eabi-objdump -D so.o

so.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <start>:
   0:   f100 0001   add.w   r0, r0, #1
   4:   f101 0102   add.w   r1, r1, #2
   8:   f7ff fffe   bl  0 <start>
   c:   e7f8        b.n 0 <start>

bl here is a branch to self, incomplete.

But take that object and link it

arm-none-eabi-ld -Ttext=0 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000000000
arm-none-eabi-objdump -d so.elf

so.elf:     file format elf32-littlearm


Disassembly of section .text:

00000000 <start>:
   0:   f100 0001   add.w   r0, r0, #1
   4:   f101 0102   add.w   r1, r1, #2
   8:   f7ff fffa   bl  0 <start>
   c:   e7f8        b.n 0 <start>

And you get the correct answer.

Sorry for the misleading answer before I was off on a tangent there for a bit.

Now if linking doesn't fix it for you in all cases then, please comment.

Another part of the problem here is the tools not helping you:

0008|   fff7feff    bl      #8         ; why not `bl #0`?

8: ff f7 fe ff                   bl      #-4

This is the same instruction formerly pair of thumb instructions 0xF7FF, 0xFFFE but for armv7-ar it is considered one instruction, inseparable 0xF7FFFFFE.

Thanks to looking this up again to work on this question I found this out since I either knew it and forgot or didn't know.

Before ARMv6T2, J1 and J2 in encodings T1 and T2 were both 1, resulting in a smaller branch range. The instructions could be executed as two separate 16-bit instructions

I have demonstrated the two instructions being separate from each other on prior to armv7 architectures and showing they are not one instruction.

Anyway:

Same instruction as this from gnu

   8:   f7ff fffe   bl  0 <start>

The gnu one is a little better but still has issues, the encoding is not bl 0 <start> but that output indicates the ultimate desire and in the end is re-encoded to be correct when linked.

So the tools were also likely part of the problem understanding what is going on by not representing the machine code in a properly decodable format.