Should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

Question

For the CC3220S manufactured by Texas Instruments, I developed a function in the C programming language which uses inline Assembly to wait 1 second (excluding the instructions before the loop and outside the loop). According to the ARMv7-M reference manual, the MOV instruction which targets the PC takes 1 + P instruction cycles where P is between 1 and 3 depending on a pipeline refill. Worst case this means that the loop executes in 6 clock cycles.

The CC3220S its clock speed is 80 MHz. However, executing the loop 10 million times creates the desired delay of 1 second (verified with a logic analyzer). This means that the loop uses 8 clock cycles. I have my doubts about the amount of clock cycles the instruction uses. Hence my question, should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

void delay_1sec(void)
{
    __asm("    PUSH {r4-r5,lr}");  

    __asm("    LDR r4, [pc, #12]"); 

    __asm("    MOV r5, pc");        
    __asm("    NOP");               

    __asm("    SUBS r4, #1");   /* 1 instruction cycle */ 
    __asm("    ITE NEQ");       /* 1 instruction cycle */ 

    __asm("    MOV pc, r5");    /* 1 + P instructions (where P is between 1 and 3 depending on                   pipeline refill) */ 


    __asm("    POP {r4-r5,pc}"); 
    __asm("    .word    10000000"); 
}

IDK whether to expect all ARMv7-M CPUs to have the same performance; seems unlikely. But separate from that: If you're going to write the whole body of a function in inline asm, including a return instruction (pop into PC), make it __attribute__((naked)) so it can't inline into other functions and break them. Also, prefer one large asm() statement. Although inside a naked` function, this is safe. But really this is total overkill; just ask the compiler for 10000000 in a "+r" (var) register and another "=r" dummy output in a GNU C Extended asm statement. — Peter Cordes
Gah, don't use inline assembly like this. Just use a separate assembly source file, so you don't have to have all this __asm("..."); nonsense and don't have to worry about the compiler inserting whatever instructions it wants. — Ross Ridge
Thanks for the suggestions but this is going too much off-topic from the original question. This is not any production code but more a prove-of-concept. — Xhendos
@Xhendos It is hard to rule out that the compiler inserts instructions on its own the way you wrote your code. — fuz
ARMv7-M is an architecture, I cannot find any cycle counting in it. The page you linked is about Cortex-M4 (which implements ARMv7-M and is sold as an IP). I'd say that all unadulterated ARM Cortex-M based on ARMv7-M will share the same core implementation and thus have the same cycle counting. However that doesn't have to be always true everywhere. You also need an 80 MHz clock for your code to work (which further restrict the set of uProcessors). ARMv7-M encourages single or low count cycle instructions, so the variation among totally different impls shouldn't be a lot, yet still noticeable. — Margaret Bloom

artless noise artless noise · Accepted Answer · 2019-12-07T19:34:07

From your reference,

The cycle counts are based on a system with zero wait states.

From your source the loop is,

SUBS r4, #1   /* 1 cycle */ 
ITE NEQ       /* 1 cycle */ 
MOV pc, r5    /* 4 cycles */

Assuming the compiler inserts no additional code, your memory can be 2 wait states when re-filling the instruction pipeline. Also, a vendor may modify the core and doesn't need to fulfill this timing requirement. Some vendors licence the 'architecture' and design the logic to implement the instruction set. Other buy a logic block that implements the Cortex-M4. I would guess TI is the later and that the memory wait-states are your issue. You didn't note which memory device your code is located in. If you system uses the 'serial flash' a two wait state additional delay would not be surprising at all. This would bring the cycle count to 8 which is what you observe.

Hence my question, should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

From above the answer is NO. If they are an architecture licensee the cycle counts maybe different. They need to be binary compatible (but even this is not always the case). However, in your case, I believe they are meeting the document it just needs to be fully applied to the use case by calculating the memory wait states. The on-board SRAM could also have wait states. Typically only TCM is zero wait state.

Should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

1 Answers