ARM Assembler NEON - Increasing performance

Question

I have converted part of an algorithm from C to ARM Assembler (using NEON instructions), but now it is 2x slower than the original C Code. How can I improve performance?

Target is a ARM Cortex-A9.

The algorithm reads 64Bit-values from an array. From this value one byte is extracted, which is then used as the lookup-value for another table. This part is done about 10 times, and each resulting table value is XOR´d with the others and the final result written into another array.

Something like this:

result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];

In my approach i load the whole array "a" in Neon Registers and then move the right byte in an arm register, calculate the offset and then load the value from the table:

vldm.64 r0, {d0-d7}         //Load 8x64Bit from the input array

vmov.u8 r12, d0[0]          //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3    // r12 = base_adress + r12 << 3
vldr.64 d8, [r12]           // d8 = mem[r12]
.
.
.
veor d8, d8, d9             // d8 = d8 ^ d9
veor d8, d8, d10            // d8 = d8 ^d10      ...ect.

Where r2 holds the base adress of the lookup table.

adress = Table_adress + (8* value_fromByte);

This step (except the loading at the beginning) is done like 100 times. Why is this so slow?

Also what are the differences between "vld", "vldr" and "vldm" - and which one is the fastest. How can i perform the offset calculation only within Neon registers? Thank you.

I don't think your C code matches the description. The C is XORing multiple bytes of the same word, but the quesiton says each byte is used to index the next. We can't optimize code if you can't show it clearly. — phkahler
Neon may not be the way to optimize this. The code is SIMD in the normal ARM instruction set. You can use a 64k table (easily generated) and process 16bits at time and you can also run 32bit of EORs at once and fold the result. The algorithm of getting a random index makes this memory bound so the code doing the EOR won't significantly affect things. — artless noise

Jake 'Alquimista' LEE Jake 'Alquimista' LEE · Accepted Answer · 2012-02-24T03:10:51

Neon isn't very capable of dealing with Lookups larger than the VTBL instruction's limits(32bytes if I remember correctly).
How's the lookup table created to start with? If it's just calculations, just let Neon do the math instead of resorting to lookups. It will be much faster this way.

ARM Assembler NEON - Increasing performance

4 Answers