ARM LDM and STM wrt data cache and databus

Question

I'm looking at an efficient method to copy 42 32-bit consecutive memory locations.
Note: A snapshot array is copied to a log array.

I'm using the LDMIA and STMIA pair (10 registers per instruction):

LDMIA  R0!, {R2-R12}    ; Read 10 array slots\n
STMIA  R1!, {R2-R12}    ; Write 10 array slots\n

My questions:

How do these instructions affect the data cache?
Is the data bus locked during the entire load/store or is it only locked per 32-bit load / store?
In other words, for the LDM instruction, does the ARM lock the data bus and load all the data into registers, or is the data bus only locked for each 32-bit transfer?

The code is running on an ARM Cortex A8 (Texas Instruments am3358).

I didn't see any hardware details in this page ARM Architecture Reference Manual

look at the amba/axi documentation, I cant imagine there is any bus locking of any kind, the load address request goes out with an id, later the I accept that address, then later here is your data. The busses are designed so that many transfers happen at once, being aligned like this and a multipel of 64 bits it is possible it is one transaction each. I have reason to believe though that the store may be turned into separate 64 bit transfers even though the length field allows for many more. — old_timer
and who knows maybe the load is broken into 8 words then 2. I dont know what the cache line size is, you want to make the largest transactions you can (as many ldm/stm registers) but if you can also align on cache lines and do whole ones I would assume that helps too. you can use the systick timer and ideally have nothing else going on (baremetal) and do some speed experiments to see if 8 registers, aligned, moves better than 10, do read only tests, write only, etc. — old_timer
a combination of the trm for the core, trm for the l2 cache, and bus information (amba/axi) is probably all the published docs you are going to find. — old_timer
I don't think the ordinary load and store instructions, including LDMIA/STMIA ever lock the external memory bus (AXI). You need to use an exclusive load or store instruction to get an exclusive lock of the AXI bus, and exclusive locks are only meaningful when an exclusive load is immediately followed by an exclusive store at the same memory location. There's is also a legacy lock mode used by the SWP instruction. Both types of AXI locks are only used if the memory region is uncached, otherwise they're handled by the L2 cache. — Ross Ridge
You might want to describe why exactly you need the data bus lock for, since the Cortex-A8 is fundamentally a single core CPU, with no support for working with other CPUs (eg. no cache coherency). If it's some other device on the bus that need to sees the snapshot array read as atomic or the log array as atomic (or both as one atomic operation) you're going to have to find some other way to keep the device off the bus. — Ross Ridge

Igor Skochinsky Igor Skochinsky · Accepted Answer · 2017-03-17T22:06:54

You should check out the Cortex-A series programming guide from ARM. I don't have it here right now to quote but AFAIR it spends quite a lot of time on the topic of efficient memory handling if not specifically on the low-level details like bus locking (you probably need to look at the AHB/AXI documentation for that, but I don't believe it's really necessary here).

ARM LDM and STM wrt data cache and databus

1 Answers