I'm working on an embedded system using ARM7TDMI processor.
In a time critical ISR, I need to take a snapshot (copy) 24 16-bit values from hardware registers into SRAM. The values are consecutive and can be treated as an array.
The data bus (to SRAM and the hardware registers) is 16-bit, and we are running in ARM mode (8/32).
At the shop we are discussing the optimal method for copying the data: as 16-bit quanitities or as 32-bit quantities.
My argument is that the ARM is in 32 bit mode, so it will make 2 16-bit fetches with one instruction faster than having two 16-bit instructions making one fetch each.
Also, there are half as many instructions to fetch, which should decrease the time by 1/2.
Anybody have any data to support either method? (My O'scopes are all allocated, so I can't make measurements on the embedded system. Also can't run huge amount of times due to an ISR interrupting every millisecond.) *(Profiling is difficult because our JTAG Jet probes don't provide the means for accurate profiling).*
Sample code - 16 it copy:
#define MAX_16_BIT_VALUES 24U
uint16_t volatile * p_hardware;
uint16_t data_from_hardware[MAX_16_BIT_VALUES];
data_from_hardware[0] = p_hardware[0];
data_from_hardware[1] = p_hardware[1];
data_from_hardware[2] = p_hardware[2];
data_from_hardware[3] = p_hardware[3];
//...
data_from_hardware[20] = p_hardware[20];
data_from_hardware[21] = p_hardware[21];
data_from_hardware[22] = p_hardware[22];
data_from_hardware[23] = p_hardware[23];
Sample code, 32-bit copy:
uint32_t * p_data_from_hardware = (uint32_t *)&data_from_hardware[0];
uint32_t volatile * p_hardware_32_ptr = (uint32_t volatile *) p_hardware;
p_data_from_hardware[0] = p_hardware_32_ptr[0];
p_data_from_hardware[1] = p_hardware_32_ptr[1];
p_data_from_hardware[2] = p_hardware_32_ptr[2];
p_data_from_hardware[3] = p_hardware_32_ptr[3];
//...
p_data_from_hardware[ 8] = p_hardware_32_ptr[ 8];
p_data_from_hardware[ 9] = p_hardware_32_ptr[ 9];
p_data_from_hardware[10] = p_hardware_32_ptr[10];
p_data_from_hardware[11] = p_hardware_32_ptr[11];
Details: ARM7TDMI processor running in 8/32-bit mode, IAR EW compiler.
Note: Code is unrolled to prevent instruction cache reloading.
Note: Assembly language listing shows that access memory using constant indices is more efficient than through an incremented pointer.
Edit 1: Testing
As per Chris Stratton's comment, we are experiencing issues when making 32-bit fetches on our 16-bit FPGA registers, so the 32-bit optimization is not possible.
That said, I profiled using the DMA. The performance gain by using the DMA controller was 30 us (microseconds). On our project, we are hoping to gain a more substantial time savings, so this optimization is not worthwhile. This experiment showed that the DMA would be very useful if we had more data to transfer, or the transfer could be in parallel.
An interesting note is that the DMA required 17 instructions to set up.