Efficient copy on ARM, two 16-bit fetches or 1 32-bit?

Question

I'm working on an embedded system using ARM7TDMI processor.

In a time critical ISR, I need to take a snapshot (copy) 24 16-bit values from hardware registers into SRAM. The values are consecutive and can be treated as an array.

The data bus (to SRAM and the hardware registers) is 16-bit, and we are running in ARM mode (8/32).

At the shop we are discussing the optimal method for copying the data: as 16-bit quanitities or as 32-bit quantities.

My argument is that the ARM is in 32 bit mode, so it will make 2 16-bit fetches with one instruction faster than having two 16-bit instructions making one fetch each.
Also, there are half as many instructions to fetch, which should decrease the time by 1/2.

Anybody have any data to support either method? (My O'scopes are all allocated, so I can't make measurements on the embedded system. Also can't run huge amount of times due to an ISR interrupting every millisecond.) *(Profiling is difficult because our JTAG Jet probes don't provide the means for accurate profiling).*

Sample code - 16 it copy:

#define MAX_16_BIT_VALUES 24U
uint16_t volatile * p_hardware;
uint16_t data_from_hardware[MAX_16_BIT_VALUES];
data_from_hardware[0] = p_hardware[0];
data_from_hardware[1] = p_hardware[1];
data_from_hardware[2] = p_hardware[2];
data_from_hardware[3] = p_hardware[3];
//...
data_from_hardware[20] = p_hardware[20];
data_from_hardware[21] = p_hardware[21];
data_from_hardware[22] = p_hardware[22];
data_from_hardware[23] = p_hardware[23];

Sample code, 32-bit copy:

uint32_t * p_data_from_hardware = (uint32_t *)&data_from_hardware[0];
uint32_t volatile * p_hardware_32_ptr = (uint32_t volatile *) p_hardware;
p_data_from_hardware[0] = p_hardware_32_ptr[0];
p_data_from_hardware[1] = p_hardware_32_ptr[1];
p_data_from_hardware[2] = p_hardware_32_ptr[2];
p_data_from_hardware[3] = p_hardware_32_ptr[3];
//...
p_data_from_hardware[ 8] = p_hardware_32_ptr[ 8];
p_data_from_hardware[ 9] = p_hardware_32_ptr[ 9];
p_data_from_hardware[10] = p_hardware_32_ptr[10];
p_data_from_hardware[11] = p_hardware_32_ptr[11];

Details: ARM7TDMI processor running in 8/32-bit mode, IAR EW compiler.

Note: Code is unrolled to prevent instruction cache reloading.
Note: Assembly language listing shows that access memory using constant indices is more efficient than through an incremented pointer.

Edit 1: Testing

As per Chris Stratton's comment, we are experiencing issues when making 32-bit fetches on our 16-bit FPGA registers, so the 32-bit optimization is not possible.

That said, I profiled using the DMA. The performance gain by using the DMA controller was 30 us (microseconds). On our project, we are hoping to gain a more substantial time savings, so this optimization is not worthwhile. This experiment showed that the DMA would be very useful if we had more data to transfer, or the transfer could be in parallel.

An interesting note is that the DMA required 17 instructions to set up.

Turn off your interrupts, and measure the performance in your main code. You probably should consider using DMA instead though... that is fastest of all. — Mark Lakata
Do you know for a fact that the registers are accessible in both modes? On some of the ARM parts I work with, there's a requirement for a specific access width for some of the SFR's. Also, you may want to consider at what size of block transfer a DMA-based memcpy implementation becomes worthwhile (probably larger than this, but?) — Chris Stratton
@MarkLakata - it seems like the code in question is in the interrupt handler, not the main program. It should be possible to write test code that just does this a bunch of time unnecessarily in the main loop, but there can be reasons why that would not give the same result. Perhaps there's a hardware cycle counter which can be accessed for sparse trials benchmarking? — Chris Stratton
@ChrisStratton: The registers are to a memory mapped FPGA and a 32-bit fetch will read 2 consecutive registers. Also, 48 bytes is worthwhile for copying with DMA controller (set up takes at least 6 instructions). — Thomas Matthews
Chris has a point. Some hardware requires to be read with a specific width. However, I can't believe the code runs any different inside the ISR than in the main loop. Code is code. If the hardware is blocking the reads, then there is no point in trying to optimize it -- much better off using DMA instead. — Mark Lakata

supercat supercat · Accepted Answer · 2013-09-17T20:12:00

If speed is of utmost importance, your best bet if the hardware can support it will be an assembly-language routine something like:

; Assume R0 holds source base and R1 holds destination base
PUSH   {R4-R7}
LDMIA R0,{R2-R7}
STMIA R1,{R2-R7}
LDMIA R0,{R2-R7}
STMIA R1,{R2-R7}
POP    {R4-R7}

I believe on the ARM7TDMI, when using a 32-bit bus, LDR takes three cycles and STR takes two; loading or storing n words with LDRMIA/STRMIA requires 3+n cycles. Thus, 12 LDR's and 12 STR's would require 60 cycles, but the sequence above should require 50 (including register save/restore). I would expect that using a 16-bit bus would add an extra cycle penalty to every 32-bit load or store, but if the LDM* and STM* instructions will split each 32-bit operation into two 16-bit ones they should still come out much faster than discrete loads and stores, especially if code has to be fetched from 16-bit memory.

Efficient copy on ARM, two 16-bit fetches or 1 32-bit?

Sample code - 16 it copy:

Sample code, 32-bit copy:

Edit 1: Testing

1 Answers