I am facing an issue implementing memcpy(src, dst, sz); for NEON
Since there is no cached memory in DMA of ARM SoC, it slows down a lot to copy from DMA.
void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
if (sz & 63) {
sz = (sz & -64) + 64;
}
asm volatile (
"NEONCopyPLD: \n"
" VLDM %[src]!,{d0-d7} \n"
" VSTM %[dst]!,{d0-d7} \n"
" SUBS %[sz],%[sz],#0x40 \n"
" BGT NEONCopyPLD \n"
: [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}
This is a code for ARMv7 by @Timothy Miller ARM/neon memcpy optimized for *uncached* memory?
And since there are no VLDM and VSTM in ARM64 instruction sets,
I am using LD and ST. However, it is as slow as memcpy() in C.
"NEONCopyPLD: \n"
"ld4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[src]], #64 \n"
"st4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[dst]], #64 \n"
"SUBS %[sz], %[sz],#0x40\n"
"BGT NEONCopyPLD \n"
is there a better way instead of using LD& ST in ARM64?
%=
to make the label names unique. Also consider prefixing your label names with.L
so they don't appear in the symbol table and don't confuse any debug tools. – fuzld4
andst4
. replace them withld1
andst1
each. movesubs
one line upward. it should beb.gt
– Jake 'Alquimista' LEE