22
votes

BACKGROUND (skip this if you like)

Let me start by saying that I am no expert programmer. I am a young junior computer vision (CV) engineer, and I am fairly experienced in C++ programming mainly because of an extensive use of the great OpenCV2 C++ API. All I've learned was through the need to execute projects, the need to solve problems and meet deadlines, as it is the reality in the industry.

Recently, we started developing CV software for embedded systems (ARM boards), and we do it using plain C++ optimized code. However, it is a huge challenge to build a real-time CV system in this kind of architecture due to its limited resources when compared to traditional computers.

Thats when I found about NEON. I've read a bunch of articles about this, but this is a fairly recent theme, so there isn't much information about it and the more I read, the more confused I get.

The QUESTION

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment? The main source of my confusion is the fact that almost all code snipets I see are in Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point. I use Eclipse IDE in Linux Gentoo to write C++ code.

UPDATE

After reading the answers I did some tests with the software. I compiled my project with the following flags:

-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon 

Keep in mind that this project include extensive libraries such as openframeworks, OpenCV and OpenNI, and everything was compiled with these flags. To compile for the ARM board we use a Linaro toolchain crosscompiler, and GCC's version is 4.8.3. Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.

Another question: all the for cycles have an apparent number of iteratons, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?

7
Thanks for the answers everyone. Although this question was (understandably) down voted, because it isn't really a question, these answers offer a great insight to those who are starting to dig into this subject :)Pedro Batista
If you were compiling with -O3, then -ftree-vectorize would already be implied, so I wouldn't expect a change in performance. Additionally, if somebody has already written the code around the bottlenecks near-optimally, I wouldn't expect to see any improvement. The only way to truly tell is to look at some performance data for the code, find the hotspots, look at how they are currently implemented and what they currently compile to, and then take a decision as to whether you can do better.James Greenhalgh
Nop, I wasn't compiling with -O3Pedro Batista
But it doesn't make a differencePedro Batista

7 Answers

15
votes

EDIT:

From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.

To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.

Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.

For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.

The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.


My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As @Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (@wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)

That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.

If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)

One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.

9
votes

Here's a "mee too" with some blog posts from ARM. FIRST, start with the following to get the background information, including 32-bit ARM (ARMV7 and below), Aarch32 (ARMv8 32-bit ARM) and Aarch64 (ARMv8 64-bit ARM):

Second, checkout the Coding for NEON series. Its a nice introduction with pictures so things like interleaved loads make sense with a glance.

I also went on Amazon looking for some books on ARM assembly with a treatment of NEON. I could only find two, and neither book's treatment of NEON were impressive. They reduced to a single chapter with the obligatory Matrix example.


I believe ARM Intrinsics are a very good idea. The instrinsics allow you to write code for GCC, Clang and Visual C/C++ compilers. We have one code base that works for ARM Linux distros (like Linaro), some iOS devices (using -arch armv7) and Microsoft gadgets (like Windows Phone and Windows Store Apps).

5
votes

If you have access to a reasonably modern GCC (GCC 4.8 and upwards) I would recommend giving intrinsics a go. The NEON intrinsics are a set of functions that the compiler knows about, which can be used from C or C++ programs to generate NEON/Advanced SIMD instructions. To gain access to them in your program, it is necessary to #include <arm_neon.h>. The verbose documentation of all available intrinsics is available at http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf , but you may find more user-friendly tutorials elsewhere online.

Advice on this site is generally against the NEON intrinsics, and certainly there are GCC versions which have done a poor job of implementing them, but recent versions do reasonably well (and if you spot bad code generation, please do raise it as a bug - https://gcc.gnu.org/bugzilla/ )

They are an easy way to program to the NEON/Advanced SIMD instruction set, and the performance you can achieve is often rather good. They are also "portable", in that when you move to an AArch64 system, a superset of the intrinsics you can use from ARMv7-A are available. They are also portable across implementations of the ARM architecture, which can vary in their performance characteristics, but which the compiler will model for performance tuning.

The principle benefit of the NEON intrinsics over hand-written assembly, is that the compiler can understand them when performing its various optimization passes. By contrast hand-written assembler is an opaque block to GCC, and will not be optimized. On the other hand, expert assembler programmers can often beat the compiler's register allocation policies, particularly when using the instructions which write to or read from to multiple consecutive registers.

5
votes

In addition to Wally's answer - and probably should be a comment, but I couldn't make it short enough: ARM has a team of compiler developers whose entire role is to improve the parts of GCC and Clang/llvm that does code generation for ARM CPUs, including features that provides "auto-vectorization" - I have not looked deeply into it, but from my experience on x86 code generation, I'd expect for anything that is relatively easy to vectorize, the compiler should do a deecent job. Some code is hard for the compiler to understand when it can vectorize or not, and may need some "encouragement" - such as unrolling loops or marking conditions as "likely" or "unlikely", etc.

Disclaimer: I work for ARM, but have very little to do with the compilers or even CPUs, as I work for the group that does graphics (where I have some involvement with compilers for the GPUs in the OpenCL part of the GPU driver).

Edit:

Performance, and use of various instruction extensions is really depending on EXACTLY what the code is doing. I'd expect that libraries such as OpenCV is already doing a fair amount of clever stuff in their code (such as both handwritten assembler as compiler intrinsics and generally code that is designed to allow the compiler to already do a good job), so it may not really give you much improvement. I'm not a computer vision expert, so I can't really comment on exactly how much such work is done on OpenCV, but I'd certainly expect the "hottest" points of the code to have been fairly well optimised already.

Also, profile your application. Don't just fiddle with optimisation flags, measure it's performance and use a profiling tool (e.g. the Linux "perf" tool) to measure WHERE your code is spending time. Then see what can be done to that particular code. Is it possible to write a more parallel version of it? Can the compiler help, do you need to write assembler? Is there a different algorithm that does the same thing but in a better way, etc, etc...

Although tweaking compiler options CAN help, and often does, it can give tens of percent, where a change in algorithm can often lead to 10 times or 100 times faster code - assuming of course, your algorithm can be improved!

Understanding what part of your application is taking the time, however, is KEY. It's no point in changing things to make the code that takes 5% of the time 10% faster, when a change somewhere else could make a piece of code that is 30 or 60% of the total time 20% faster. Or optimise some math routine, when 80% of the time is spent on reading a file, where making the buffer twice the size would make it twice as fast...

5
votes

Although a long time has passed since I submitted this question, I realize that it gathers some interest and I decided to tell what I ended up doing regarding this.

My main goal was to optimize a for-loop which was the bottleneck of the project. So, since I don't know anything about Assembly I decided to give NEON intrinsics a go. I ended up having a 40-50% gain in performance (in this loop alone), and a significant overall improvement in performance of the whole project.

The code does some math to transform a bunch of raw distance data into distance to a plane in millimetres. I use some constants (like _constant05, _fXtoZ) that are not defined here, but they are just constant values defined elsewhere. As you can see, I'm doing the math for 4 elements at a time, talk about real parallelization :)

unsigned short* frameData = frame.ptr<unsigned short>(_depthLimits.y, _depthLimits.x);

unsigned short step = _runWidth - _actWidth; //because a ROI being processed, not the whole image

cv::Mat distToPlaneMat = cv::Mat::zeros(_runHeight, _runWidth, CV_32F);

float* fltPtr = distToPlaneMat.ptr<float>(_depthLimits.y, _depthLimits.x); //A pointer to the start of the data

for(unsigned short y = _depthLimits.y; y < _depthLimits.y + _depthLimits.height; y++)
{
    for (unsigned short x = _depthLimits.x; x < _depthLimits.x + _depthLimits.width - 1; x +=4)
    {
        float32x4_t projX = {(float)x, (float)(x + 1), (float)(x + 2), (float)(x + 3)};
        float32x4_t projY = {(float)y, (float)y, (float)y, (float)y};

        framePixels = vld1_u16(frameData);

        float32x4_t floatFramePixels = {(float)framePixels[0], (float)framePixels[1], (float)framePixels[2], (float)framePixels[3]};

        float32x4_t fNormalizedY = vmlsq_f32(_constant05, projY, _yResInv);

        float32x4_t auxfNormalizedX = vmulq_f32(projX, _xResInv);
        float32x4_t fNormalizedX = vsubq_f32(auxfNormalizedX, _constant05);

        float32x4_t realWorldX = vmulq_f32(fNormalizedX, floatFramePixels);

        realWorldX = vmulq_f32(realWorldX, _fXtoZ);

        float32x4_t realWorldY = vmulq_f32(fNormalizedY, floatFramePixels);
        realWorldY = vmulq_f32(realWorldY, _fYtoZ);

        float32x4_t realWorldZ = floatFramePixels;

        realWorldX = vsubq_f32(realWorldX, _tlVecX);
        realWorldY = vsubq_f32(realWorldY, _tlVecY);
        realWorldZ = vsubq_f32(realWorldZ, _tlVecZ);

        float32x4_t distAuxX, distAuxY, distAuxZ;

        distAuxX = vmulq_f32(realWorldX, _xPlane);
        distAuxY = vmulq_f32(realWorldY, _yPlane);
        distAuxZ = vmulq_f32(realWorldZ, _zPlane);

        float32x4_t distToPlane = vaddq_f32(distAuxX, distAuxY);
        distToPlane = vaddq_f32(distToPlane, distAuxZ);

        *fltPtr = (float) distToPlane[0];
        *(fltPtr + 1) = (float) distToPlane[1];
        *(fltPtr + 2) = (float) distToPlane[2];
        *(fltPtr + 3) = (float) distToPlane[3];

        frameData += 4;
        fltPtr += 4;
    }
    frameData += step;
    fltPtr += step;
}
4
votes

If you don't want to mess with assembly code at all, then tweak the compiler flags to maximally optimize for speed. gcc given the proper ARM target should do this provided the number of loop iterations is apparent.

To check gcc code generation, request assembly output by adding the -S flag.

If after several tries (of reading the gcc documentation and tweaking flags) you still can't get it to produce the code you want, then take the assembly output and edit it to your satisfaction.


Beware of premature optimization. The proper development order is to get the code functional, then see if it needs optimization. Only when the code is stable does it makes sense to do so.

0
votes

Play with some minimal assembly examples on QEMU to understand the instructions

The following setup does not have many examples yet, but it serves as a neat playground:

The examples run on QEMU user mode, which dispenses extra hardware, and the GDB is working just fine.

The asserts are done through the C standard library.

You should be a able to easily extend that setup with new instructions as you learn them.

ARM intrinsincs in particular were asked at: Is there a good reference for ARM Neon intrinsics?