I'm trying to assemble this code using Keystone and execute it with the Unicorn engine:
start:
add r0, r0, #1
add r1, r1, #2
bl start
b start
In my opinion, the bl
instruction should save the address of the next instruction to the lr
register and then jump to start
. So it'll be an infinite loop that adds 1
to r0
and 2
to r1
.
Apparently, I'm wrong, because bl start
branches to itself instead!
I'm using Python wrappers for Keystone, Capstone and Unicorn to process the assembly. Here's my code:
import keystone as ks
import capstone as cs
import unicorn as uc
print(f'Keystone {ks.__version__}\nCapstone {cs.__version__}\nUnicorn {uc.__version__}\n')
code = '''
start:
add r0, r0, #1
add r1, r1, #2
bl start
b start
'''
assembler = ks.Ks(ks.KS_ARCH_ARM, ks.KS_MODE_THUMB)
disassembler = cs.Cs(cs.CS_ARCH_ARM, cs.CS_MODE_THUMB)
emulator = uc.Uc(uc.UC_ARCH_ARM, uc.UC_MODE_THUMB)
machine_code, _ = assembler.asm(code)
machine_code = bytes(machine_code)
print(machine_code.hex())
initial_address = 0
for addr, size, mnem, op_str in disassembler.disasm_lite(machine_code, initial_address):
instruction = machine_code[addr:addr + size]
print(f'{addr:04x}|\t{instruction.hex():<8}\t{mnem:<5}\t{op_str}')
emulator.mem_map(initial_address, 1024) # allocate 1024 bytes of memory
emulator.mem_write(initial_address, machine_code) # write the machine code
emulator.hook_add(uc.UC_HOOK_CODE, lambda uc, addr, size, _: print(f'Address: {addr}'))
emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)
This is what it outputs:
Keystone 0.9.1
Capstone 5.0.0
Unicorn 1.0.2
00f1010001f10201fff7fefff8e7
0000| 00f10100 add.w r0, r0, #1
0004| 01f10201 add.w r1, r1, #2
0008| fff7feff bl #8 ; why not `bl #0`?
000c| f8e7 b #0
Address: 0
Address: 4
Address: 8 # OK, we arrived at BL start
Address: 8 # we're at the same instruction again?
Address: 8 # and again?
Address: 8
< ... >
Address: 8
Address: 8
Traceback (most recent call last):
File "run_ARM_bug.py", line 32, in <module>
emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/unicorn-1.0.2rc3-py3.7.egg/unicorn/unicorn.py", line 317, in emu_start
unicorn.unicorn.UcError: Emulation timed out (UC_ERR_TIMEOUT)
The exception is not a problem (I set the timeout myself). The problem is that bl start
always jumps to itself instead of start
.
If I jump forward, however, everything will work as expected, so this works - bl
jumps to the correct address:
start:
; stuff
bl next
; hello
next:
add r0, r0, #1
bkpt
EDIT
I went on and assembled this code with Clang:
; test.s
.text
.syntax unified
.globl start
.p2align 1
.code 16
.thumb_func
start:
add r0, r0, #1
add r1, r1, #2
bl start
b start
Used the following commands:
$ clang -c test.s -target armv7-unknown-linux -o test.bin -mthumb
clang-11: warning: unknown platform, assuming -mfloat-abi=soft
And then disassembled test.bin
with objdump
:
$ objdump -d test.bin
test.bin: file format elf32-littlearm
Disassembly of section .text:
00000000 <start>:
0: 00 f1 01 00 add.w r0, r0, #1
4: 01 f1 02 01 add.w r1, r1, #2
8: ff f7 fe ff bl #-4
c: ff f7 fe bf b.w #-4 <start+0x10>
$
So bl
's argument is actually an offset. It's negative because we're going backwards. BUT, as the documentation says:
For
B
,BL
,CBNZ
, andCBZ
instructions, the value of the PC is the address of the current instruction plus 4 bytes.
So bl #-4
will jump to (the address of bl) + 4 bytes - 4 bytes
, or, in other words, itself, again!
So, I can't bl
backwards for some reason? What's happening here and how to fix it?
factorial
function that uses the samebl
instruction with Clang 3.7 (!) onarmv7-apple-darwin
(old jailbroken iPhone 4), and it worked, but the generated machine code was nothing like the one here. I rebuilt the same assembly forarmv7-none-eabi
(sadly, I don't have anywhere to run it) with like three different versions of Clang, and all of them generated thebl
instruction that, according toobjdump
, was going to jump to itself. So the only "correct" version was generated by that old Clang for the iPhone. I have no clue why – ForceBruf7ff fffa
as machine code forbl start
, but in my case with Keystone it's swapped likefff7 feff
(should've beenf7ff fffe
, I guess, but that's stillfe
, notfa
). Looks like an endianness issue. BTW, I think Keystone & friends actually use Clang's code to assemble and disassemble, so I'm thinking it's kinda pointless to compare it to "actual" Clang... Anyway, only that old Clang 3.7 generated something that definitely worked. I'm gonna try and summarize my research about the cursedbl
instruction in Clang tomorrow and update the question. – ForceBru