How can I make this loop run faster?

Question

I'm using this code to find the highest temperature pixel in a thermal image and the coordinates of the pixel.

void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
    int temp = 0;

    for (int i = sz; i > 0; i--)
    {
        if (returnPixel->temperature < *image)
        {
            returnPixel->temperature = *image;
            temp = i;
        }
        image++;
    }

    returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
    returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}

With an image size of 640x480 it takes around 35ms to run through this function, which is too slow for what I need it for (under 10ms ideally).

This is executing on an ARM A9 processor running Linux.

The compiler I'm using is ARM v8 32-Bit Linux gcc compiler.

I'm using optimize -O3 and the following compile options: -march=armv7-a+neon -mcpu=cortex-a9 -mfpu=neon-fp16 -ftree-vectorize.

This is the output from the compiler:

000127f4 <_findMax>:
    for(int i = sz; i > 0; i--)
   127f4:   e3510000    cmp r1, #0
{
   127f8:   e52de004    push    {lr}        ; (str lr, [sp, #-4]!)
    for(int i = sz; i > 0; i--)
   127fc:   da000014    ble 12854 <_findMax+0x60>
   12800:   e1d2c0b0    ldrh    ip, [r2]
   12804:   e2400002    sub r0, r0, #2
    int temp = 0;
   12808:   e3a0e000    mov lr, #0
        if(returnPixel->temperature < *image)
   1280c:   e1f030b2    ldrh    r3, [r0, #2]!
   12810:   e153000c    cmp r3, ip
            returnPixel->temperature = *image;
   12814:   81a0c003    movhi   ip, r3
   12818:   81a0e001    movhi   lr, r1
   1281c:   81c230b0    strhhi  r3, [r2]
    for(int i = sz; i > 0; i--)
   12820:   e2511001    subs    r1, r1, #1
   12824:   1afffff8    bne 1280c <_findMax+0x18>
   12828:   e30c3ccd    movw    r3, #52429  ; 0xcccd
   1282c:   e34c3ccc    movt    r3, #52428  ; 0xcccc
   12830:   e0831e93    umull   r1, r3, r3, lr
   12834:   e1a034a3    lsr r3, r3, #9
   12838:   e0831103    add r1, r3, r3, lsl #2
   1283c:   e6ff3073    uxth    r3, r3
   12840:   e04ee381    sub lr, lr, r1, lsl #7
   12844:   e6ffe07e    uxth    lr, lr
    returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
   12848:   e1c2e0b4    strh    lr, [r2, #4]
    returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
   1284c:   e1c230b6    strh    r3, [r2, #6]
}
   12850:   e49df004    pop {pc}        ; (ldr pc, [sp], #4)
    for(int i = sz; i > 0; i--)
   12854:   e3a03000    mov r3, #0
   12858:   e1a0e003    mov lr, r3
   1285c:   eafffff9    b   12848 <_findMax+0x54>

For clarity after comments:

Each pixel is a unsigned 16 bit integer, image[0] would be the pixel with coordinates 0,0, and the last in the array would have the coordinates 639,479.

You could check every second line and every second column, which would make it roughly 4 times faster. You loose precision but it might be good enough for your needs. — Jabberwocky
a) Unfold for loop a little. e.g. process several pixels each iteration. See: duffs device -> en.wikipedia.org/wiki/Duff%27s_device, b) Use some gradient path algorithm: en.wikipedia.org/wiki/Gradient_descent — avans
How are you reading the pixels? Finding the hottest at that time may be trivial. — Andrew Henle
@pqans -I added the option -funroll-loops which got it down to 13ms — James Swift
Yep so that algorithm isn't useful, unless you can find a way to split the image into several "hot spot squares". If you can do that, then gradient descent is much faster than "brute force" which is the current algorithm. — Lundin

Brendan Brendan · Accepted Answer · 2020-11-05T10:51:15

This is executing on an ARM A9 processor running Linux.

ARM Cortex-A9 supports Neon.

With this in mind the goal should be to load 8 values (128 bits of pixel data) into a register, then do "compare with the current maximums for each of the 8 places" to get a mask, then use the mask and its inverse to mask out the "too small" old maximums and the "too small" new values; then OR the results to merge the new higher values into the "current maximums for each of the 8 places".

Once that has been done for all pixels (using a loop); you'd want to find the highest value in the "current maximums for each of the 8 places".

However; to find the location of the hottest pixel (rather than just how hot it is) you'd want to split the image into tiles (e.g. maybe 8 pixels wide and 8 pixels tall). This allows you to find the max. temperature within each tile (using Neon); then find the pixel within the hottest tile. Note that for huge images this lends itself to a "multi-layer" approach - e.g. create a smaller image containing the maximum from each tile in the original image; then do the same thing again to create an even smaller image containing the maximum from each "group of tiles", then ...

Making this work in plain C means trying to convince the compiler to auto-vectorize. The alternatives are to use compiler intrinsics or inline assembly. In any of these cases, using Neon to do 8 pixels in parallel (without any branches) could/should improve performance significantly (how much depends on RAM bandwidth).

How can I make this loop run faster?

6 Answers