Fast ARM NEON memcpy

Question

I want to copy an image on an ARMv7 core. The naive implementation is to call memcpy per line.

for(i = 0; i < h; i++) {
  memcpy(d, s, w);
  s += sp;
  d += dp;
}

I know that the following

d, dp, s, sp, w

are all 32-byte aligned, so my next (still quite naive) implementation was along the lines of

for (int i = 0; i < h; i++) {
  uint8_t* dst = d;
  const uint8_t* src = s;
  int remaining = w;
  asm volatile (
    "1:                                               \n"
    "subs     %[rem], %[rem], #32                     \n"
    "vld1.u8  {d0, d1, d2, d3}, [%[src],:256]!        \n"
    "vst1.u8  {d0, d1, d2, d3}, [%[dst],:256]!        \n"
    "bgt      1b                                      \n"
    : [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining)
    :
    : "d0", "d1", "d2", "d3", "cc", "memory"
  );
  d += dp;
  s += sp;
}

Which was ~150% faster than memcpy over a large number of iterations (on different images, so not taking advantage of caching). I feel like this should be nowhere near the optimum because I am yet to use preloading, but when I do I only seem to be able to make performance substantially worse. Does anyone have any insight here?

Try unrolling the loop by at least 2X. NEON loads are not instantaneous due to pipelining and memory speed. If you do 2 loads followed by 2 stores, you should see a benefit. The cache preload can definitely speed things up, but the read-ahead distance needs to be tuned to your target platform. — BitBank
I tried that but the difference was negligible. I followed the same reasoning but bear in mind that those loads and stores are only 2 cycles each (source). Cache line size is 64 bytes, I tried prefetching 64, 128, 192 and 256 bytes ahead, all of which made this considerably slower (2-3 times). — robbie_c
Have you tried looking at memcpy source? Maybe it is already optimized and uses NEON instructions on your platform. — Mārtiņš Možeiko
Prefetching is notoriously difficult to get right and rarely helpful. For memcpy you have no computation cycles to speak of so there probably isn't anything to be gained from prefetching. — Paul R
Have you thought about using the DMA? I don't know how much faster/slower the copy would be, but you could be doing other processing, so your overall app speed may improve? — Josh Petitt

Peter M Peter M · Accepted Answer · 2013-02-12T16:46:55

ARM has a great tech note on this.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

Your performance will definitely vary depending on the micro-architecture, ARM's note is on the A8 but I think it will give you a decent idea, and the summary at the bottom is a great discussion of the various pros and cons that go beyond just the regular numbers, such as which methods result in the least amount of register usage, etc.

And yes, as another commenter mentions, pre-fetching is very difficult to get right, and will work differently with different micro-architectures, depending on how big the caches are and how big each line is and a bunch of other details about the cache design. You can end up thrashing lines you need if you aren't careful. I would recommend avoiding it for portable code.

Fast ARM NEON memcpy

1 Answers