3
votes

Can someone tell me a fast function to count the number of white pixels in a binary image. I need it for iOS app dev. I am working directly on the memory of the image defined as

  bool *imageData = (bool *) malloc(noOfPixels * sizeof(bool));

I am implementing the function

             int whiteCount = 0;
             for (int q=i; q<i+windowHeight; q++)
             {
                 for (int w=j; w<j+windowWidth; w++)
                 { 
                     if (imageData[q*W + w] == 1)
                         whiteCount++;
                 }
             }

This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS can be used to make several operations in 1 cycle. Maybe thats the way to go ??

The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.

The only code in neon intrinsics that I am able to find online is the code for rgb to gray http://computer-vision-talks.com/2011/02/a-very-fast-bgra-to-grayscale-conversion-on-iphone/

2
I'll take a look at this, but what is sizeof(bool) ?Paul R
Also, what are the possible values in imageData[] ? Is it just 0 or 1, or can there be other non-zero values ?Paul R

2 Answers

3
votes

Firstly you can speed up the original code a little by factoring out the multiply and getting rid of the branch:

 int whiteCount = 0;
 for (int q = i; q < i + windowHeight; q++)
 {
     const bool * const row = &imageData[q * W];

     for (int w = j; w < j + windowWidth; w++)
     { 
         whiteCount += row[w];
     }
 }

(This assumes that imageData[] is truly binary, i.e. each element can only ever be 0 or 1.)

Here is a simple NEON implementation:

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const bool * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += row[j];
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

(This assumes that imageData[] is truly binary, imageWidth <= 2^19, and sizeof(bool) == 1.)


Updated version for unsigned char and values of 255 for white, 0 for black:

#include <arm_neon.h>

// ...

int i, w;
int whiteCount = 0;
const uint8x16_t v_mask = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
uint32x4_t v_count = { 0 };

for (q = i; q < i + windowHeight; q++)
{
    const uint8_t * const row = &imageData[q * W];

    uint16x8_t vrow_count = { 0 };

    for (w = j; w <= j + windowWidth - 16; w += 16) // SIMD loop
    {
        uint8x16_t v = vld1q_u8(&row[j]);           // load 16 x 8 bit pixels
        v = vandq_u8(v, v_mask);                    // mask out all but LS bit
        vrow_count = vpadalq_u8(vrow_count, v);     // accumulate 16 bit row counts
    }
    for ( ; w < j + windowWidth; ++w)               // scalar clean up loop
    {
        whiteCount += (row[j] == 255);
    }
    v_count = vpadalq_u16(v_count, vrow_count);     // update 32 bit image counts
}                                                   // from 16 bit row counts
// add 4 x 32 bit partial counts from SIMD loop to scalar total
whiteCount += vgetq_lane_s32(v_count, 0);
whiteCount += vgetq_lane_s32(v_count, 1);
whiteCount += vgetq_lane_s32(v_count, 2);
whiteCount += vgetq_lane_s32(v_count, 3);
// total is now in whiteCount

(This assumes that imageData[] is has values of 255 for white and 0 for black, and imageWidth <= 2^19.)


Note that all the above code is untested and may need some further work.

0
votes

http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

Section 6.55.3.6

The vectorized algorithm will do the comparisons and put them in a structure for you, but you'd still need to go through each element of the structure and determine if it's a zero or not.

How fast does that loop currently run and how fast do you need it to run? Also remember that NEON will work in the same registers as the floating point unit, so using NEON here may force an FPU context switch.