arm neon transpose 4x4 uint32

Question

I'm trying to rotate and image counter-clockwise by 90 degrees, and then flip it horizontally.

My first approach, was to just use OpenCV:

cv::transpose(in, tmp); // transpose around top left
cv::flip(tmp, out, -1); // flip on both axes

For performance, I'm trying to merge the two functions into one.

My code:

void ccw90_hflip_640x480(const cv::Mat& img, cv::Mat& out)
{
    assert(img.cols == 640 && img.rows == 480);
    assert(out.cols == 480 && out.cols == 640);

    uint32_t* imgData = (uint32_t*)img.data;
    uint32_t* outData = (uint32_t*)out.data;

    uint32_t *pRow = imgData;
    uint32_t *pEnd = imgData + (640 * 480);
    uint32_t *dstCol = outData + (480 * 640) - 1;

    for( ; pRow != pEnd; pRow += 640, dstCol -= 1)
    {
        for(uint32_t *ptr = pRow, *end = pRow + 640, *dst = dstCol;
            ptr != end;
            ++ptr, dst -= 480)
        {
            *dst = *ptr;
        }
    }
}

I thought the above would be faster, but it's not. I can't think of any reason it wouldn't be faster, beside OpenCV possibly using NEON.

I found this article/presentation: http://shervinemami.info/NEON_RotateBGRX.swf

The transposition and flipping are blurred together in a way that makes it very hard to modify to where it would rotate the other way, and flip around the horizontal axis like I need it too. The article is very old, so I'm hoping there is a more straightforward way of doing what I need.

So what's the easiest way to transpose a 4x4 matrix of uint32 using arm NEON?

Indeed, I've traveled to the future, where I confirmed that my arm neon implementation is faster. Unfortunately, before I could write down the algorithm, my time machine malfunctioned, sending me to an alternate universe, where everyone uses linked lists instead of vectors. — CuriousGeorge
See here for solutions (scroll down a bit for code that actually works). — Paul R
@PaulR Thanks for the link. I've got transposition working as per the article, but it'll take some time to get the full algorithm pieced together and do some measurements. — CuriousGeorge
@PaulR I ended up using the first method from the article you recommended(vtrn, vtrn, vswp, vswp). The de-interleave loading discussed there was a no-go because of the element size/count I'm working with. After updating the above algorithm to process 4x4 pixels at a time using the aforementioned transposition instructions(and flipping, as discussed in the *.swf file linked above), I was able to decrease the run time to ~30% of what it was. This makes your response the most correct, so if you post an answer, I will accept. Thanks. — CuriousGeorge
Glad to know it worked out - feel free to write this up as a self-answer for the benefit of future readers (I'm on a mobile device just now so not in a great position to write up a decent answer). — Paul R

CuriousGeorge CuriousGeorge · Accepted Answer · 2016-04-30T19:25:41

The following code is equivalent to the OpenCV calls in the original post, but performs several times faster(at least on my device).

Using Neon did indeed increase performance significantly. Since the transposition happens inside the CPU, memory access can be streamlined to read and write pixels in sets of four, instead of one at a time, as discussed in the comments.

void ccw90_hflip_640x480_neon(const cv::Mat& img, cv::Mat& out)
{
    assert(img.cols == 640 && img.rows == 480);
    assert(out.cols == 480 && out.cols == 640);

    uint32_t *pRow = (uint32_t*)img.data;
    uint32_t *pEnd = (uint32_t*)img.data + (640 * 480);
    uint32_t *dstCol = (uint32_t*)out.data + (480 * 640) - (480 * 3) - 4;

    for( ; pRow != pEnd; pRow += 640 * 4, dstCol -= 4)
    {
        for(uint32_t *ptr = pRow, *end = pRow + 640, *dst = dstCol;
            ptr != end;
            ptr += 4, dst -= 480 * 4)
        {
            uint32_t* in0 = ptr;
            uint32_t* in1 = in0 + 640;
            uint32_t* in2 = in1 + 640;
            uint32_t* in3 = in2 + 640;

            uint32_t* out0 = dst;
            uint32_t* out1 = out0 + 480;
            uint32_t* out2 = out1 + 480;
            uint32_t* out3 = out2 + 480;

            asm("vld1.32 {d0, d1}, [%[in0]]    \n"
                "vld1.32 {d2, d3}, [%[in1]]    \n"
                "vld1.32 {d4, d5}, [%[in2]]    \n"
                "vld1.32 {d6, d7}, [%[in3]]    \n"
                "vtrn.32 q0, q1                \n"
                "vtrn.32 q2, q3                \n"
                "vswp d1, d4                   \n"
                "vswp d3, d6                   \n"
                "vrev64.32 q0, q0              \n"
                "vrev64.32 q1, q1              \n"
                "vrev64.32 q2, q2              \n"
                "vrev64.32 q3, q3              \n"
                "vswp d0, d1                   \n"
                "vswp d2, d3                   \n"
                "vswp d4, d5                   \n"
                "vswp d6, d7                   \n"
                "vst1.32 {d6, d7}, [%[out0]]   \n"
                "vst1.32 {d4, d5}, [%[out1]]   \n"
                "vst1.32 {d2, d3}, [%[out2]]   \n"
                "vst1.32 {d0, d1}, [%[out3]]   \n"
                :
                : [out0] "r" (out0), [out1] "r" (out1), [out2] "r" (out2), [out3] "r" (out3),
                    [in0] "r" (in0), [in1] "r" (in1), [in2] "r" (in2), [in3] "r" (in3)
                : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7"
                );
        }
    }
}

arm neon transpose 4x4 uint32

2 Answers