2
votes

I'm trying to rotate and image counter-clockwise by 90 degrees, and then flip it horizontally.

My first approach, was to just use OpenCV:

cv::transpose(in, tmp); // transpose around top left
cv::flip(tmp, out, -1); // flip on both axes

For performance, I'm trying to merge the two functions into one.

My code:

void ccw90_hflip_640x480(const cv::Mat& img, cv::Mat& out)
{
    assert(img.cols == 640 && img.rows == 480);
    assert(out.cols == 480 && out.cols == 640);

    uint32_t* imgData = (uint32_t*)img.data;
    uint32_t* outData = (uint32_t*)out.data;

    uint32_t *pRow = imgData;
    uint32_t *pEnd = imgData + (640 * 480);
    uint32_t *dstCol = outData + (480 * 640) - 1;

    for( ; pRow != pEnd; pRow += 640, dstCol -= 1)
    {
        for(uint32_t *ptr = pRow, *end = pRow + 640, *dst = dstCol;
            ptr != end;
            ++ptr, dst -= 480)
        {
            *dst = *ptr;
        }
    }
}

I thought the above would be faster, but it's not. I can't think of any reason it wouldn't be faster, beside OpenCV possibly using NEON.

I found this article/presentation: http://shervinemami.info/NEON_RotateBGRX.swf

The transposition and flipping are blurred together in a way that makes it very hard to modify to where it would rotate the other way, and flip around the horizontal axis like I need it too. The article is very old, so I'm hoping there is a more straightforward way of doing what I need.

So what's the easiest way to transpose a 4x4 matrix of uint32 using arm NEON?

2
Indeed, I've traveled to the future, where I confirmed that my arm neon implementation is faster. Unfortunately, before I could write down the algorithm, my time machine malfunctioned, sending me to an alternate universe, where everyone uses linked lists instead of vectors.CuriousGeorge
See here for solutions (scroll down a bit for code that actually works).Paul R
@PaulR Thanks for the link. I've got transposition working as per the article, but it'll take some time to get the full algorithm pieced together and do some measurements.CuriousGeorge
@PaulR I ended up using the first method from the article you recommended(vtrn, vtrn, vswp, vswp). The de-interleave loading discussed there was a no-go because of the element size/count I'm working with. After updating the above algorithm to process 4x4 pixels at a time using the aforementioned transposition instructions(and flipping, as discussed in the *.swf file linked above), I was able to decrease the run time to ~30% of what it was. This makes your response the most correct, so if you post an answer, I will accept. Thanks.CuriousGeorge
Glad to know it worked out - feel free to write this up as a self-answer for the benefit of future readers (I'm on a mobile device just now so not in a great position to write up a decent answer).Paul R

2 Answers

2
votes

The following code is equivalent to the OpenCV calls in the original post, but performs several times faster(at least on my device).

Using Neon did indeed increase performance significantly. Since the transposition happens inside the CPU, memory access can be streamlined to read and write pixels in sets of four, instead of one at a time, as discussed in the comments.

void ccw90_hflip_640x480_neon(const cv::Mat& img, cv::Mat& out)
{
    assert(img.cols == 640 && img.rows == 480);
    assert(out.cols == 480 && out.cols == 640);

    uint32_t *pRow = (uint32_t*)img.data;
    uint32_t *pEnd = (uint32_t*)img.data + (640 * 480);
    uint32_t *dstCol = (uint32_t*)out.data + (480 * 640) - (480 * 3) - 4;

    for( ; pRow != pEnd; pRow += 640 * 4, dstCol -= 4)
    {
        for(uint32_t *ptr = pRow, *end = pRow + 640, *dst = dstCol;
            ptr != end;
            ptr += 4, dst -= 480 * 4)
        {
            uint32_t* in0 = ptr;
            uint32_t* in1 = in0 + 640;
            uint32_t* in2 = in1 + 640;
            uint32_t* in3 = in2 + 640;

            uint32_t* out0 = dst;
            uint32_t* out1 = out0 + 480;
            uint32_t* out2 = out1 + 480;
            uint32_t* out3 = out2 + 480;

            asm("vld1.32 {d0, d1}, [%[in0]]    \n"
                "vld1.32 {d2, d3}, [%[in1]]    \n"
                "vld1.32 {d4, d5}, [%[in2]]    \n"
                "vld1.32 {d6, d7}, [%[in3]]    \n"
                "vtrn.32 q0, q1                \n"
                "vtrn.32 q2, q3                \n"
                "vswp d1, d4                   \n"
                "vswp d3, d6                   \n"
                "vrev64.32 q0, q0              \n"
                "vrev64.32 q1, q1              \n"
                "vrev64.32 q2, q2              \n"
                "vrev64.32 q3, q3              \n"
                "vswp d0, d1                   \n"
                "vswp d2, d3                   \n"
                "vswp d4, d5                   \n"
                "vswp d6, d7                   \n"
                "vst1.32 {d6, d7}, [%[out0]]   \n"
                "vst1.32 {d4, d5}, [%[out1]]   \n"
                "vst1.32 {d2, d3}, [%[out2]]   \n"
                "vst1.32 {d0, d1}, [%[out3]]   \n"
                :
                : [out0] "r" (out0), [out1] "r" (out1), [out2] "r" (out2), [out3] "r" (out3),
                    [in0] "r" (in0), [in1] "r" (in1), [in2] "r" (in2), [in3] "r" (in3)
                : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7"
                );
        }
    }
}
1
votes

Neon will not help significantlyNote. Your code is simply moving data; neon can not make the underlying memory faster by a significant amount. See this article; using the PLD will also help. I would suggest that you process dst in order and jump around with ptr. The cache will pre-fill ptr and dst will line fill.

Here is an alternate form of traversing memory (variable names may not make sense),

uint32_t *pEnd = imgData + 640;
uint32_t *dstCol = outData;

for( ; pRow != pEnd; pRow ++)
{
    for(uint32_t *ptr = pRow, *dst = dstCol, *end = dst + 480;
        dst != end;
        ptr += 640, dst++)
    {
        *dst = *ptr;
    }
    // could flush `dstCol` here as it is complete or hope the system clues in.
    dstCol += 480;
}

The idea is to fill dst in order and jump around accessing the imgData. All modern memory will fill more efficiently if you write it out in order. The cache and synchronous DRAM usually fill several 32 bit words at a time. We can unroll the inner loop with knowledge of the L1 cache. It is either 32 or 64 bytes representing 8 or 16 32bit pixels. The fill will be a similar amount, so you could reduce the transpose to cacheable blocks and process each at one time. Think of the the 640x480 image as being composes of 8*8 pixel tiles (minimum L1 cache size) and process each in turn.

After you do this, the NEON instructions may gain some percentage. However, optimizing the load/store unit (common to all CPU units) should be the first step.

NOTE: Neon is SIMD (single instruction, multiple data) and it excels at number crunching the pixels to give a computational boost by processing several at a time. It does have some instructions that will optimize memory traversal, but the underlying memory is the same for the CORE CPU units and the SIMD/NEON units. It is possible NEON will give a boost but I think it is futile until you have optimized your access order for your memory system.