We have an app that requires ~1MB buffers for a hardware device to fill, therefore we wrote a kernel module that allocates buffers using kmalloc(). We did not use dma_alloc_coherent() as we need to manipulative the buffers and therefore wanted them to be cached (we flush the cache when needed). One of the manipulations that is done is the kernel module copies one buffer to another buffer. In timing these copies we see it takes about ~2ms to copy a buffer. The time does not include any cache flushing.
As this seemed slow we wrote a standard userspace test app, that used malloc() to create 1MB buffers and copied them. The userspace copies took about .5ms, which is about the correct time to move this amount of memory on the processor/memory config we are using.
Thinks we tried: To make sure it wasn't a different memcpy() in kernel space and user space we wrote our own NEON optimized copy, but made no difference. Changed the buffer size from 100KB to 10MB and made no difference. All times were over 10 copies, but always very very consistent. Time routine used gettimeofday() in userspace.
Only thing we can think of is that the data cache is setup up different for kmalloc()'ed memory then malloc()'ed memory???
We are working on iMX6 ARM, Linaro kerne.