4
votes

I'm trying to improve my image processing project running on an ARM cortex-a8 processor.

I was accessing 8-bit Grayscale Image data from memory. In my function, right now I'm accessing individual pixel value, byte-by-byte.

I thought that by making use of NEON I can improve this by accessing 128/8 = 16 bytes in one shot from memory and then make use of them in my function. But upon running the changed version I see that this is actually taking MORE time than byte-by-byte access. I think that my fetching using NEON is becoming a bottleneck, taking more time than my computation time.

What is the data bus size of ARM Cortex-A8? How many bytes are accessed from memory in one memory fetch?

2
The cache will have typically abstracted this away. From SDRAM, it will do burst reads and writes. If you are using direct screen memory, then the cache maybe write through. The answer will depend on what memory you are using. You should always benchmark memory performance and then compare to your code. See: Cortex-A8 memory copy. - artless noise

2 Answers

3
votes

From the Cortex A8 TRM:

"You can configure the processor to connect to either a 64-bit or 128-bit AXI interconnect that provides flexibility to system designs"

Is NEON necessary, perhaps you are comparing apples to oranges? Instead of ldrb/strb you can use ldrd/strd or ldm/stm to get 64 bit transfers. The ARM/AXI can be smart enough to look ahead and group smaller transfers into larger transfers, say two 32 bit transfers into one 64 bit. But I would not rely on that. I only mention it in case you find that by changing to an ldr/str or ldrd/strd you dont make any performance gains.

Did you isolate (no data processing) the read or write loop and try bytes vs words vs double words? It may be that the code to extract bytes from words overwhelms the savings on the bus.

What type of memory is this? Is this on chip or off chip, that sort of thing, what speed is this memory relative to the AXI (ARM) clock speed?

Do you have the data cache enabled for this region? If so it may be a mute point, the first byte read will do a cache line fill using an optimal data bus size, subsequent reads within that cache line will not reach the AXI bus much less the target memory. Likewise the writes should only go as far as the cache and go out to the target in a wider bus optimized size later. Depends on how the cache/write buffer is configured.

0
votes

It could be that you experience pipeline stalls. If you want to read through Neon there will be some latency before you can use that data in the CPU core.