I have converted part of an algorithm from C to ARM Assembler (using NEON instructions), but now it is 2x slower than the original C Code. How can I improve performance?
Target is a ARM Cortex-A9.
The algorithm reads 64Bit-values from an array. From this value one byte is extracted, which is then used as the lookup-value for another table. This part is done about 10 times, and each resulting table value is XOR´d with the others and the final result written into another array.
Something like this:
result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];
In my approach i load the whole array "a" in Neon Registers and then move the right byte in an arm register, calculate the offset and then load the value from the table:
vldm.64 r0, {d0-d7} //Load 8x64Bit from the input array
vmov.u8 r12, d0[0] //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3 // r12 = base_adress + r12 << 3
vldr.64 d8, [r12] // d8 = mem[r12]
.
.
.
veor d8, d8, d9 // d8 = d8 ^ d9
veor d8, d8, d10 // d8 = d8 ^d10 ...ect.
Where r2 holds the base adress of the lookup table.
adress = Table_adress + (8* value_fromByte);
This step (except the loading at the beginning) is done like 100 times. Why is this so slow?
Also what are the differences between "vld", "vldr" and "vldm" - and which one is the fastest. How can i perform the offset calculation only within Neon registers? Thank you.