0
votes

I want to be able to run and debug a binary generated from pure assembly on an ARM Cortex-M4 microcontroller without having to use inline assembly inside a C program.

I have a linker script and some utility C startup code, which sets up the interrupt vector table, implements the Reset_Handler function, copies the .data section from flash to SRAM and then calls main(). This workflow works ok, but it's a bit clunky, and I would rather write assembly directly instead of inline in a C program that is nothing more than main() with the assembly mnemonics. I also want to know out of interest - maybe there is a better way altogether of going about this. The Reset_Handler function looks like this:

void Reset_Handler(void)
{
        //copy .data section to SRAM
        uint32_t size = (uint32_t)&_edata - (uint32_t)&_sdata;

        uint8_t *pDst = (uint8_t*)&_sdata; //sram
        uint8_t *pSrc = (uint8_t*)&_la_data; //flash

        for(uint32_t i =0 ; i < size ; i++)
        {
                *pDst++ = *pSrc++;
        }

        //Init. the .bss section to zero in SRAM
        size = (uint32_t)&_ebss - (uint32_t)&_sbss;
        pDst = (uint8_t*)&_sbss;
        for(uint32_t i =0 ; i < size ; i++)
        {
                *pDst++ = 0;
        }

        __libc_init_array();

        main();

}

EDIT: The details of my toolchain and linked script are included below.

  • Board: STM32F407VG with Cortex-M4.
  • OpenOCD and GDB for debugging
  • vim for code editor (my purpose is to work on baremetal without any IDE-provided startup or linker code).
  • arm-none-eabi-gcc for compiling and linking

I am trying to follow along with this tutorial, but instead of running the code in a VM, goal is to run and debug directly on the board.

My linker code:


ENTRY(Reset_Handler)

MEMORY
{
  FLASH(rx):ORIGIN =0x08000000,LENGTH =1024K
  SRAM(rwx):ORIGIN =0x20000000,LENGTH =128K
}


SECTIONS
{
  .text :
  {
    *(.isr_vector)
    *(.text)
    *(.rodata)
    . = ALIGN(4);
    _etext = .;
  }> FLASH
  
  _la_data = LOADADDR(.data);
  
  .data :
  {
    _sdata = .;
    *(.data)
    *(.data.*)
    . = ALIGN(4);
    _edata = .;
  }> SRAM AT> FLASH
  
  .bss :
  {
    _sbss = .;
    __bss_start__ = _sbss;
    *(.bss)
    *(.bss.*)
    *(COMMON)
    . = ALIGN(4);
    _ebss = .;
    __bss_end__ = _ebss;
       . = ALIGN(4); 
    end = .;
    __end__ = .;
  }> SRAM

  
}
1
"maybe there is a better way altogether" Yes. stackoverflow.com/a/47940277/584518. Most importantly you aren't setting up the system clock before running this code, which is almost always a severe performance bug.Lundin
Your question seems to be "how to write assembly" basically. There are many online guides that should at least get you started. The exact details will depend on your toolchain.domen
you certainly dont set up the clocks or anything else in the C bootstrap, that happens later in the C code. cart before the horse...Using C to bootstrap C is a huge problem though.old_timer
where is the linker script that goes with this code. Liker scripts and bootstraps have an intimate relationship.old_timer
so you start off with entirely in assembly language but then not sure are you wanting C code as well still?old_timer

1 Answers

2
votes

Your linker script

ENTRY(Reset_Handler)
MEMORY
{
  FLASH(rx):ORIGIN =0x08000000,LENGTH =1024K
  SRAM(rwx):ORIGIN =0x20000000,LENGTH =128K
}
SECTIONS
{
  .text :
  {
    *(.isr_vector)
    *(.text)
    *(.rodata)
    . = ALIGN(4);
    _etext = .;
  }> FLASH
  _la_data = LOADADDR(.data);
  .data :
  {
    _sdata = .;
    *(.data)
    *(.data.*)
    . = ALIGN(4);
    _edata = .;
  }> SRAM AT> FLASH
  .bss :
  {
    _sbss = .;
    __bss_start__ = _sbss;
    *(.bss)
    *(.bss.*)
    *(COMMON)
    . = ALIGN(4);
    _ebss = .;
    __bss_end__ = _ebss;
       . = ALIGN(4);
    end = .;
    __end__ = .;
  }> SRAM
}

Since you have read the arm and st documents you know that the vector table starts with a stack pointer load value then the reset vector then other vectors, can be hundreds depending on the chip. The chip vendor maps the application flash at 0x08000000 and with certain boot options that can be mirrored to 0x00000000 where it needs to be for arm to boot off of it. And ram starts at 0x20000000 and is of some size based on the chip.

.cpu cortex-m4

.word 0x20001000
.word Reset_Handler
.word loop
.word loop

.globl Reset_Handler
.thumb_func
Reset_Handler:
    b loop

.thumb_func
loop:
    b .

.align
.word 0x11223344
.word _edata
.word _sdata
.word _la_data
.word _ebss
.word _sbss
.word 0x55667788

Is not a bad starting point. The linker as you know from reading up on it can generate variables if you will which you can then use in your code as seen in the C code and is just as available here.

build it

arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m4 so.s -o so.o
arm-none-eabi-ld -nostdlib -nostartfiles -T so.ld so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy -O binary so.elf so.bin
arm-none-eabi-objcopy -O srec --srec-forceS3 so.elf so.srec

examine the dump

Disassembly of section .text:

08000000 <Reset_Handler-0x10>:
 8000000:   20001000    andcs   r1, r0, r0
 8000004:   08000011    stmdaeq r0, {r0, r4}
 8000008:   08000013    stmdaeq r0, {r0, r1, r4}
 800000c:   08000013    stmdaeq r0, {r0, r1, r4}

08000010 <Reset_Handler>:
 8000010:   e7ff        b.n 8000012 <loop>

08000012 <loop>:
 8000012:   e7fe        b.n 8000012 <loop>
 8000014:   11223344            ; <UNDEFINED> instruction: 0x11223344
 8000018:   20000000    andcs   r0, r0, r0
 800001c:   20000000    andcs   r0, r0, r0
 8000020:   08000030    stmdaeq r0, {r4, r5}
 8000024:   20000000    andcs   r0, r0, r0
 8000028:   20000000    andcs   r0, r0, r0
 800002c:   55667788    strbpl  r7, [r6, #-1928]!   ; 0xfffff878

That is disassembled so it is trying to disassemble everything, look at this

08000000 <Reset_Handler-0x10>:
 8000000:   20001000   sp initialization value
 8000004:   08000011   reset handler address orred with one (see the docs)
 8000008:   08000013   some other handler
 800000c:   08000013   some other handler


 8000014:   11223344   .word 0x11223344
 8000018:   20000000   .word _edata
 800001c:   20000000   .word _sdata
 8000020:   08000030   .word _la_data
 8000024:   20000000   .word _ebss
 8000028:   20000000   .word _sbss
 800002c:   55667788   .word 0x55667788

There is no .data so edata and sdata are at the same place. la_data is a kind of strange thing, and then no .bss either so start and end in the same place. so add some

.cpu cortex-m4

.word 0x20001000
.word Reset_Handler
.word loop
.word loop

.globl Reset_Handler
.thumb_func
Reset_Handler:
    b loop

.thumb_func
loop:
    b .

.align
.word 0x11223344
.word _edata
.word _sdata
.word _la_data
.word _ebss
.word _sbss
.word 0x55667788

.section .bss
.byte 0

.section .data
.byte 0x66


Disassembly of section .text:

08000000 <Reset_Handler-0x10>:
 8000000:   20001000    andcs   r1, r0, r0
 8000004:   08000011    stmdaeq r0, {r0, r4}
 8000008:   08000013    stmdaeq r0, {r0, r1, r4}
 800000c:   08000013    stmdaeq r0, {r0, r1, r4}

08000010 <Reset_Handler>:
 8000010:   e7ff        b.n 8000012 <loop>

08000012 <loop>:
 8000012:   e7fe        b.n 8000012 <loop>
 8000014:   11223344            ; <UNDEFINED> instruction: 0x11223344
 8000018:   20000004    andcs   r0, r0, r4
 800001c:   20000000    andcs   r0, r0, r0
 8000020:   08000030    stmdaeq r0, {r4, r5}
 8000024:   20000008    andcs   r0, r0, r8
 8000028:   20000004    andcs   r0, r0, r4
 800002c:   55667788    strbpl  r7, [r6, #-1928]!   ; 0xfffff878

Disassembly of section .data:

20000000 <_sdata>:
20000000:   00000066    andeq   r0, r0, r6, rrx

Disassembly of section .bss:

20000004 <__bss_start__>:
20000004:   00000000    andeq   r0, r0, r0

 8000018:   20000004    andcs   r0, r0, r4
 800001c:   20000000    andcs   r0, r0, r0
 8000020:   08000030    stmdaeq r0, {r4, r5}
 8000024:   20000008    andcs   r0, r0, r8
 8000028:   20000004    andcs   r0, r0, r4

so .data goes from 0x20000000 to 0x20000004(-1) and bss from 0x20000004 to 0x20000008(-1)

S00A0000736F2E7372656338
S315080000000010002011000008130000081300000863
S31508000010FFE7FEE744332211040000200000002019
S315080000203000000808000020040000208877665584
S309080000306600000058
S70508000011E1

and at address 0x0800030 we can see the .data value

So you can simply re-write the C code in assembly language (did not need to do this analysis but good to). If you do not put alignment into the linker script then you have to do a byte by byte copy like the C code or if lucky and want to put the code in for it you can try to instrument something faster but both ends need to be unaligned in the same way.

The things you need to do in your bootstrap for an mcu like this, minimum,

1) stack pointer
2) .data
3) .bss
4) call/branch to C entry point
5) infinite loop

Many folks will say you should never return from main() but

1) you can protect them anyway, and they will thank you later
2) they perhaps have not created a purely event driven solution

Does not hurt. So as you read in the documentation from arm they have a mechanism for loading the stack pointer, if you use that then that checks the first box.

Not intended to be lean and mean, wholly untested, maybe buggy:

.cpu cortex-m4
.syntax unified

.word 0x20001000
.word Reset_Handler
.word loop
.word loop

.globl Reset_Handler
.thumb_func
Reset_Handler:
    /*copy .data section to SRAM */
    /*uint32_t size = (uint32_t)&_edata - (uint32_t)&_sdata;*/
    ldr r0,=_edata
    ldr r1,=_sdata
    subs r0,r0,r1
    bne data_loop_done

    /*uint8_t *pDst = (uint8_t*)&_sdata; //sram*/
    /*uint8_t *pSrc = (uint8_t*)&_la_data; //flash*/

    ldr r2,=_la_data

    /*
    for(uint32_t i =0 ; i < size ; i++)
    {
            *pDst++ = *pSrc++;
    }
    */

data_loop:
    ldrb r3,[r2]
    adds r2,#1
    strb r3,[r1]
    adds r1,#1
    subs r0,r0,#1
    bne data_loop
data_loop_done:

    /*
    Init. the .bss section to zero in SRAM
    size = (uint32_t)&_ebss - (uint32_t)&_sbss;
    pDst = (uint8_t*)&_sbss;
    for(uint32_t i =0 ; i < size ; i++)
    {
            *pDst++ = 0;
    }
    */

    ldr r0,=_ebss
    ldr r1,=_sbss
    mov r2,#0
    subs r0,r0,r1
    bne bss_loop_done
bss_loop:
    strb r2,[r1]
    adds r1,#1
    bne bss_loop
bss_loop_done:

    /*__libc_init_array();*/
    bl __libc_init_array

    /*main();*/
    bl main

    b loop

.thumb_func
loop:
    b .

__libc_init_array:
    bx lr

main:
    bx lr

.align
.word 0x11223344
.word _edata
.word _sdata
.word _la_data
.word _ebss
.word _sbss
.word 0x55667788

.section .bss
.byte 0

.section .data
.byte 0x66

But functional

08000010 <Reset_Handler>:
 8000010:   4814        ldr r0, [pc, #80]   ; (8000064 <main+0x1e>)
 8000012:   4915        ldr r1, [pc, #84]   ; (8000068 <main+0x22>)
 8000014:   1a40        subs    r0, r0, r1
 8000016:   d106        bne.n   8000026 <data_loop_done>
 8000018:   4a14        ldr r2, [pc, #80]   ; (800006c <main+0x26>)

0800001a <data_loop>:
 800001a:   7813        ldrb    r3, [r2, #0]
 800001c:   3201        adds    r2, #1
 800001e:   700b        strb    r3, [r1, #0]
 8000020:   3101        adds    r1, #1
 8000022:   3801        subs    r0, #1
 8000024:   d1f9        bne.n   800001a <data_loop>

08000026 <data_loop_done>:
...
 8000064:   20000004    andcs   r0, r0, r4
 8000068:   20000000    andcs   r0, r0, r0
 800006c:   08000078    stmdaeq r0, {r3, r4, r5, r6}

If you are careful you can do it without forcing thumb2 instructions where not necessary. You may be able to improve this with thumb2 instructions but if the linker script does its job then you can use ldr/str and do a word at a time possibly comparing with the end value not a size. Whichever...

Hmm, yeah I did leave an instruction out of the above code...

    ldr r0,=_ebss
    ldr r1,=_sbss
    mov r2,#0
    cmp r0,r1
    beq bss_loop_done
bss_loop:
    str r2,[r1]
    adds r1,#4
    cmp r0,r1
    bne bss_loop
bss_loop_done:

should be four or more times faster depending on the system (chip). BUT you have to insure that the start and end addresses are aligned. You can go further than that by increasing the alignment to a double-word boundary

    ldr r0,=_ebss
    ldr r1,=_sbss
    mov r2,#0
    mov r3,#0
    cmp r0,r1
    beq bss_loop_done
bss_loop:
    stm r1!,{r2,r3}
    cmp r0,r1
    bne bss_loop
bss_loop_done:

Could have used the stm in the word at a time loop and saved an instruction. You might see a gain with 4 words at a time but might not on a cortex-m, getting up to 2 words is a nice balance. And you can do the same optimizations with the .data copy.

I hope this was not a homework assignment, you still get to find and debug it if it were. But it is a simple matter of reading and porting the code. Looking at the endless supply of examples out there.

Looking at the linker script now on the screen it was designed for:

.cpu cortex-m4
.syntax unified

.section .isr_vector

.word 0x20001000
.word Reset_Handler
.word loop
.word loop

.section .text

.globl Reset_Handler
.thumb_func
Reset_Handler:
    b loop

.thumb_func
loop:
    b .

Disassembly of section .text:

08000000 <Reset_Handler-0x10>:
 8000000:   20001000    andcs   r1, r0, r0
 8000004:   08000011    stmdaeq r0, {r0, r4}
 8000008:   08000013    stmdaeq r0, {r0, r1, r4}
 800000c:   08000013    stmdaeq r0, {r0, r1, r4}

08000010 <Reset_Handler>:
 8000010:   e7ff        b.n 8000012 <loop>

08000012 <loop>:
 8000012:   e7fe        b.n 8000012 <loop>

So that you do not have to get the objects on the command line in a certain order.

There is an intimate relationship between the linker script and the bootstrap code, you can't really have one without the other, they are a pair. You cannot or should not attempt to mix and match various linker scripts and bootstrap code from projects willy nilly, need to keep them together as designed.

Linker scripts are not portable and assembly language is not assumed to be portable so IMO you should make each as simple and lean and mean as possible, less is more, less to port, less to maintain, less toolchain specific stuff. That is not the general view of developers they love to make grossly over complicated linker scripts. The C library can play a role here too, with the gnu model the C library is really a separate part and you can insert whichever one you want (and it comes with its related bootstrap and linker script), but that depends on how that library works, the target, etc.

A microcontroller without an RTOS is not really C library friendly so you have to ask yourself do I really need a C library, how much simpler and smaller (and cheaper) and readable and more maintainable can I make this project?

Mine tend to look like this

.thumb_func
reset:
    bl main
    b .

MEMORY
{
    rom : ORIGIN = 0x08000000, LENGTH = 0x1000
    ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
    .text   : { *(.text*)   } > rom
    .rodata : { *(.rodata*) } > rom
    .bss    : { *(.bss*)    } > ram
}

For each one of us reading this with this experience you are going to see a different style, different opinion, etc. That is another feature of bare-metal, the freedom to do it your own way, only truly bound by the hardware rules, nothing else. No-one's solution is really wrong, it just reflects their style.