2
votes

I have the following 32-bit neon code that simply extracts an image:

extractY8ImageARM(unsigned char *from, unsigned char *to, int left, int top, int width, int height, int stride)
from: pointer to the original image
to: pointer to the destination extracted image
left, top: position where to extract in the original image
width, height: size of the extracted image
stride: width of the original image

and here is the assembly code:

.text
.arch armv7-a
.fpu neon
.type extractY8ImageARM, STT_FUNC
.global extractY8ImageARM

extractY8ImageARM:
from    .req r0
to  .req r1
left    .req r2
top .req r3
width   .req r4
height  .req r5
stride  .req r6
tmp .req r7

    push {r0-r7, lr}

//Let's get back the arguments
    ldr width, [sp, #(9 * 4)]
    ldr height, [sp, #(10 * 4)]
    ldr stride, [sp, #(11 * 4)]

//Update the from pointer. Advance left + stride * top
    add from, from, left
    mul tmp, top, stride
    add from, from, tmp

.loopV:
//We will copy width
    mov tmp, width

.loopH:
//Read and store data
    pld [from]
    vld1.u8 { d0, d1, d2, d3 }, [from]!

    pld [to]
    vst1.u8 { d0, d1, d2, d3 }, [to]!

    subs tmp, tmp, #32
    bgt .loopH

//We advance the from pointer for the next line
    add from, from, stride
    sub from, from, width

    subs height, height, #1
    bgt .loopV


    pop {r0-r7, pc}

.unreq from
.unreq to
.unreq left
.unreq top
.unreq width
.unreq height
.unreq stride
.unreq tmp

I need to port it to 64-bit neon. can anyone help me to do the translation? I have read this white paper http://malideveloper.arm.com/downloads/Porting%20to%20ARM%2064-bit.pdf so I understand more or less the differences.

My code is simple and it would be a good example how to pass arguments and load/store data in a 64-bit neon assembly file. I prefer to avoid intrinsic.

1
See my answer here: stackoverflow.com/questions/28050300/… It really would be a good idea to use intrinsics. Your NEON code is not very optimized and would be portable to both ARM32 and ARM64 if you used intrinsics.BitBank
If you insist on writing ARM64 assembly language, you can see my github project here for a complete example: github.com/bitbank2/gcc_perfBitBank
Even with intrinsics, there are too many changes between neon-32bit to neon-64bit.gregoiregentil
I got the ld1 / st1 instructions. Thanks. That's useful. I'm unsure how to translate the push/pop.gregoiregentil
It would be interesting to understand why the 32-bit neon code is not optimized. This is just a loop with vld1 and vst1. What could you do more or differently? Note that the width needs to be divisible by 16, not 32.gregoiregentil

1 Answers

0
votes

The whole code looks like this:

.text
.arch armv8-a
.type extractY8ImageARM, STT_FUNC
.global extractY8ImageARM

extractY8ImageARM:
from    .req x0
to  .req x1
left    .req x2
top .req x3
width   .req x4
height  .req x5
stride  .req x6
tmp .req x9

    add from, from, left
    mul tmp, top, stride
    add from, from, tmp

.loopV:
    mov tmp, width

.loopH:
    ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [from], #64

    st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [to], #64

    subs tmp, tmp, #64
    bgt .loopH

    add from, from, stride
    sub from, from, width

    subs height, height, #1
    bgt .loopV

    ret


.unreq from
.unreq to
.unreq left
.unreq top
.unreq width
.unreq height
.unreq stride
.unreq tmp