memcpy for ARM uncached memory for ARM64

Question

I am facing an issue implementing memcpy(src, dst, sz); for NEON

Since there is no cached memory in DMA of ARM SoC, it slows down a lot to copy from DMA.

void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
    if (sz & 63) {
        sz = (sz & -64) + 64;
    }
    asm volatile (
        "NEONCopyPLD:                          \n"
        "    VLDM %[src]!,{d0-d7}                 \n"
        "    VSTM %[dst]!,{d0-d7}                 \n"
        "    SUBS %[sz],%[sz],#0x40                 \n"
        "    BGT NEONCopyPLD                  \n"
        : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}

This is a code for ARMv7 by @Timothy Miller ARM/neon memcpy optimized for *uncached* memory?

And since there are no VLDM and VSTM in ARM64 instruction sets,

I am using LD and ST. However, it is as slow as memcpy() in C.

"NEONCopyPLD: \n"
"ld4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[src]], #64 \n"
"st4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[dst]], #64 \n"
"SUBS %[sz], %[sz],#0x40\n"
"BGT NEONCopyPLD \n"

is there a better way instead of using LD& ST in ARM64?

Note that you shouldn't use plain old labels in inline assembly as the inline assembly snippet might be instantiated more than once in your program. Instead, use %= to make the label names unique. Also consider prefixing your label names with .L so they don't appear in the symbol table and don't confuse any debug tools. — fuz
there is absolutely no reason for using ld4 and st4. replace them with ld1 and st1 each. move subs one line upward. it should be b.gt — Jake 'Alquimista' LEE

Jake 'Alquimista' LEE Jake 'Alquimista' LEE · Accepted Answer · 2020-04-16T10:43:52

aarch64 features memory operations for uncached area. (non-temporal)

Below is what I suggest:

"NEONCopyPLD: \n"
"sub %[dst], %[dst], #64 \n"
"1: \n"
"ldnp q0, q1, [%[src]] \n"
"ldnp q2, q3, [%[src], #32] \n"
"add %[dst], %[dst], #64 \n"
"subs %[sz], %[sz], #64 \n"
"add %[src], %[src], #64 \n"
"stnp q0, q1, [%[dst]] \n"
"stnp q2, q3, [%[dst], #32] \n"
"b.gt 1b \n"

for cached area:

"NEONCopyPLD: \n"
"sub %[src], %[src], #32 \n"
"sub %[dst], %[dst], #32 \n"
"1: \n"
"ldp q0, q1, [%[src], #32] \n"
"ldp q2, q3, [%[src], #64]! \n"
"subs %[sz], %[sz], #64 \n"
"stp q0, q1, [%[dst], #32] \n"
"stp q2, q3, [%[dst], #64]! \n"
"b.gt 1b \n"

memcpy for ARM uncached memory for ARM64

1 Answers