0
votes

I am facing an issue implementing memcpy(src, dst, sz); for NEON

Since there is no cached memory in DMA of ARM SoC, it slows down a lot to copy from DMA.

void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
    if (sz & 63) {
        sz = (sz & -64) + 64;
    }
    asm volatile (
        "NEONCopyPLD:                          \n"
        "    VLDM %[src]!,{d0-d7}                 \n"
        "    VSTM %[dst]!,{d0-d7}                 \n"
        "    SUBS %[sz],%[sz],#0x40                 \n"
        "    BGT NEONCopyPLD                  \n"
        : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}

This is a code for ARMv7 by @Timothy Miller ARM/neon memcpy optimized for *uncached* memory?

And since there are no VLDM and VSTM in ARM64 instruction sets,

I am using LD and ST. However, it is as slow as memcpy() in C.

"NEONCopyPLD: \n"
"ld4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[src]], #64 \n"
"st4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[dst]], #64 \n"
"SUBS %[sz], %[sz],#0x40\n"
"BGT NEONCopyPLD \n"

is there a better way instead of using LD& ST in ARM64?

1
That is an ARMv7 not an ARM7 but anyway...old_timer
Note that you shouldn't use plain old labels in inline assembly as the inline assembly snippet might be instantiated more than once in your program. Instead, use %= to make the label names unique. Also consider prefixing your label names with .L so they don't appear in the symbol table and don't confuse any debug tools.fuz
there is absolutely no reason for using ld4 and st4. replace them with ld1 and st1 each. move subs one line upward. it should be b.gtJake 'Alquimista' LEE

1 Answers

1
votes

aarch64 features memory operations for uncached area. (non-temporal)

Below is what I suggest:

"NEONCopyPLD: \n"
"sub %[dst], %[dst], #64 \n"
"1: \n"
"ldnp q0, q1, [%[src]] \n"
"ldnp q2, q3, [%[src], #32] \n"
"add %[dst], %[dst], #64 \n"
"subs %[sz], %[sz], #64 \n"
"add %[src], %[src], #64 \n"
"stnp q0, q1, [%[dst]] \n"
"stnp q2, q3, [%[dst], #32] \n"
"b.gt 1b \n"

for cached area:

"NEONCopyPLD: \n"
"sub %[src], %[src], #32 \n"
"sub %[dst], %[dst], #32 \n"
"1: \n"
"ldp q0, q1, [%[src], #32] \n"
"ldp q2, q3, [%[src], #64]! \n"
"subs %[sz], %[sz], #64 \n"
"stp q0, q1, [%[dst], #32] \n"
"stp q2, q3, [%[dst], #64]! \n"
"b.gt 1b \n"