I'm trying to improve my image processing project running on an ARM cortex-a8 processor.
I was accessing 8-bit Grayscale Image data from memory. In my function, right now I'm accessing individual pixel value, byte-by-byte.
I thought that by making use of NEON I can improve this by accessing 128/8 = 16 bytes in one shot from memory and then make use of them in my function. But upon running the changed version I see that this is actually taking MORE time than byte-by-byte access. I think that my fetching using NEON is becoming a bottleneck, taking more time than my computation time.
What is the data bus size of ARM Cortex-A8? How many bytes are accessed from memory in one memory fetch?