3
votes

I'm trying to write as tiny code as possible to extract the firmware of Infineon's XMC4500 microcontroller.

The code must fit into a 30 byte buffer which allows me to have 15 machine instructions using Thumb 16-bit instruction set.

Starting with C my attempt is to dump flash memory through a single GPIO pin (see original question) following this nifty trick.

Basically what I'm doing is:

  1. Setup the GPIO pin directions to output
  2. Blink LED1 (pin 1.1) with a clock (SPI serial clock)
  3. Blink LED2 (pin 1.0) with data bits (SPI MOSI)
  4. Sniff pins with a logic analyzer

EDIT:

  1. UPDATE C CODE BLOCK
  2. ADD ASSEMBLY CODE BLOCK
#include "XMC4500.h"

void main() {
  // start dumping at memory address 0x00000000
  unsigned int* p = (uint32_t *)(0x0u);

  // configure port1 output (push-pull)
  PORT1->IOCR0 = 0x8080u;

  for(;;) {
    int i = 32;

    int data = *(p++);

    do {
      // clock low
      PORT1->OUT = 0x0;

      // clock high with data bits
      PORT1->OUT = 0x2u | data;

      data >>= 1;

    } while (--i > 0);
  }
}
main:
    ; PORT1->IOCR0 = 0x8080UL
    ldr r1, =0x48028100 ; load port1 base address to R1
    movw r2, #0x8080 ; move 0x8080 to R2
    str r2, [r1, #0x10]

main_1:
    ; start copying at address 0x00000000
    ; R12 is known to be zeroed
    ldr.w r2, [r12], #0x4 ; int data = *(p++)
    movs r3, #32 ; int i = 32

main_2:
    ; PORT1->OUT = 0x0
    ; clock low
    ; R12 is known to be zeroed
    str r12, [r1]

    ; PORT1->OUT = 0x2 | data
    ; clock high with data bits
    orr r4, r2, #0x2
    str r4, [r1]

    asrs r2, r2, #0x1 ; data >>= 1

    subs r3, r3, #0x1 ; i--
    bne.n main_2 ; while (--i > 0)
    b.n main_1 ; while(true)

However code size is still too big to meet my requirements.

Is there anything I can do to further shrink down my code? Anything that can be optimized or left out?

2
for such tight requirements you will have to do it in assembly. The compiler will not optimize for space (it will do speed versus space) and it will not use quirks as they are going to usually break with updates. Take your program, compile and read/decode the assembly output.Paul Sullivan
Can you load the buffer and then reload the buffer? Ie, can you have different code sequences to initialize registers to known values and then run the code that actually performs the data transfer? It is unlikely otherwise, as much of the code will be loading constants.artless noise
@artlessnoise Unfortunately reloading the buffer is not an option.user3696425
Have you tested your changes? Either r12 is always 0, so you can use it to clear the clock, or it is advancing with the ldr.w, in which case you can use it as p. Not both. And the latest change is now also writing "random" data to the high pins of Port1. Are you sure that won't cause problems?AShelly
@AShelly My changes are working fine. 1.: r12 is indeed advancing with the ldr.w. However ldr.w increments r12 by 4 which will give me always an even number. Bit 1 will be zero at all times. In that case I can use r12 as p and for clearing the clock. 2.: For toggling the two pins I only need the first two bits; so I guess there's no need to worry about the high pins, right?user3696425

2 Answers

3
votes

If the high bits of Port1 don't change during this process, and you can ensure that you read the data bit slightly after the clock goes high, you could try something like this:

#define P1_DEFAULT =   ?//constant high bits of port 1, zeros in low two bits
int* dp=0;             //maybe use a register which is known to be zeroed.
PORT1->IOCR0 = 0x8080;  //should be 3 ins
for(;;){
  int i=32;            //
  int data=*(dp++);    //LDMIA instruction may do load and increment in 1 step.
  do{
    PORT1->OUT = P1_DEFAULT  #clock low
    PORT1->OUT = P1_DEFAULT + 2+ (data&1); #clock high with data
    data>>=1;
  } while (--i>0);
}

This should remove three port read, one port write and a conditional.
Do it all in one function to avoid any call overhead. I would start with the generated assembly for this, and see what you can do to improve it.

2
votes

Sixteen instructions isn't a whole lot; I would not expect that a C compiler could produce code efficient enough to bit-bang a memory dump. If you aren't too picky about the output bit patterns, I think 32 bytes will suffice using something like:

    ldr  r1,=Port1      ; Address of IO Port
    mov  r3,#1
    str  r3,[r1+IOCR0]
    lsl  r0,r3,#27
bytes:
    mov  r5,#9
    strb r5,[r1+OUT]
    add  r0,#1
    ldrb r4,[r0]
bits:
    strb r4,[r1+OUT]
    lsr  r4,#1
    sub  r5,#1
    bne  bits
    b    bytes

Each byte will be output as a high pulse, followed by eight bit times that may be high or low (depending upon data read), followed by a bit time which is always zero to ensure the rising edge of the next high pulse will be visible. Basically similar to asynchronous serial communication, but with the levels reversed.