3
votes

I have converted part of an algorithm from C to ARM Assembler (using NEON instructions), but now it is 2x slower than the original C Code. How can I improve performance?

Target is a ARM Cortex-A9.

The algorithm reads 64Bit-values from an array. From this value one byte is extracted, which is then used as the lookup-value for another table. This part is done about 10 times, and each resulting table value is XOR´d with the others and the final result written into another array.

Something like this:

result[i] = T0[ GetByte0( a[i1] ) ] ^ T1[ GetByte1( a[i2] ) ] ^ ... ^ T10[ (...) ];

In my approach i load the whole array "a" in Neon Registers and then move the right byte in an arm register, calculate the offset and then load the value from the table:

vldm.64 r0, {d0-d7}         //Load 8x64Bit from the input array

vmov.u8 r12, d0[0]          //Mov the first Byte from d0 into r12
add r12, r2, r12, asl #3    // r12 = base_adress + r12 << 3
vldr.64 d8, [r12]           // d8 = mem[r12]
.
.
.
veor d8, d8, d9             // d8 = d8 ^ d9
veor d8, d8, d10            // d8 = d8 ^d10      ...ect.

Where r2 holds the base adress of the lookup table.

adress = Table_adress + (8* value_fromByte);

This step (except the loading at the beginning) is done like 100 times. Why is this so slow?

Also what are the differences between "vld", "vldr" and "vldm" - and which one is the fastest. How can i perform the offset calculation only within Neon registers? Thank you.

4
I don't think your C code matches the description. The C is XORing multiple bytes of the same word, but the quesiton says each byte is used to index the next. We can't optimize code if you can't show it clearly.phkahler
Yes you are right. I edited it. It is always another word.HectorLector
Neon may not be the way to optimize this. The code is SIMD in the normal ARM instruction set. You can use a 64k table (easily generated) and process 16bits at time and you can also run 32bit of EORs at once and fold the result. The algorithm of getting a random index makes this memory bound so the code doing the EOR won't significantly affect things.artless noise

4 Answers

3
votes

Neon isn't very capable of dealing with Lookups larger than the VTBL instruction's limits(32bytes if I remember correctly).
How's the lookup table created to start with? If it's just calculations, just let Neon do the math instead of resorting to lookups. It will be much faster this way.

2
votes

don't use

vmov.u8 r12, d0[0]

moving data from NEON register to the ARM register is the worst thing you can do.

Maybe you should see VTBL instruction ! What is you byte range 0..255 ?

1
votes

May be you can try

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d0}, [r3]

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

ldrb     r12, [r0], #1
add      r3, r2, r12, asl #3
vld1.64  {d1}, [r3]
veor     d0, d0, d1         // d8 = d8 ^ d1

...

That will not be the best solution. After that you can increase performance by re ordering instruction.

0
votes

Try it with NEON "intrinsics". Basically they're C functions that compile down to NEON instructions. The compiler still gets to do all the instruction scheduling, and you get the other boring stuff (moving data about) for free.

It doesn't always work perfectly, but it might be better than trying to hand code it.

Look for arm_neon.h.