Coding for ARM NEON: How to start?

Question

BACKGROUND (skip this if you like)

Let me start by saying that I am no expert programmer. I am a young junior computer vision (CV) engineer, and I am fairly experienced in C++ programming mainly because of an extensive use of the great OpenCV2 C++ API. All I've learned was through the need to execute projects, the need to solve problems and meet deadlines, as it is the reality in the industry.

Recently, we started developing CV software for embedded systems (ARM boards), and we do it using plain C++ optimized code. However, it is a huge challenge to build a real-time CV system in this kind of architecture due to its limited resources when compared to traditional computers.

Thats when I found about NEON. I've read a bunch of articles about this, but this is a fairly recent theme, so there isn't much information about it and the more I read, the more confused I get.

The QUESTION

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used in C++ environment? The main source of my confusion is the fact that almost all code snipets I see are in Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point. I use Eclipse IDE in Linux Gentoo to write C++ code.

UPDATE

After reading the answers I did some tests with the software. I compiled my project with the following flags:

-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon

Keep in mind that this project include extensive libraries such as openframeworks, OpenCV and OpenNI, and everything was compiled with these flags. To compile for the ARM board we use a Linaro toolchain crosscompiler, and GCC's version is 4.8.3. Would you expect this to improve the performance of the project? Because we experienced no changes at all, which is rather weird considering all the answers I read here.

Another question: all the for cycles have an apparent number of iteratons, but many of them iterate through custom data types (structs or classes). Can GCC optimize these cycles even though they iterate through custom data types?

Thanks for the answers everyone. Although this question was (understandably) down voted, because it isn't really a question, these answers offer a great insight to those who are starting to dig into this subject :) — Pedro Batista
If you were compiling with -O3, then -ftree-vectorize would already be implied, so I wouldn't expect a change in performance. Additionally, if somebody has already written the code around the bottlenecks near-optimally, I wouldn't expect to see any improvement. The only way to truly tell is to look at some performance data for the code, find the hotspots, look at how they are currently implemented and what they currently compile to, and then take a decision as to whether you can do better. — James Greenhalgh

Rob Napier Rob Napier · Accepted Answer · 2015-02-16T21:07:36

EDIT:

From your update, you may misunderstand what the NEON processor does. It is an SIMD (Single Instruction, Multiple Data) vector processor. That means that it is very good at performing an instruction (say "multiply by 4") to several pieces of data at the same time. It also loves to do things like "add all these numbers together" or "add each element of these two lists of numbers to create a third list of numbers." So if you problem looks like those things the NEON processor is going to be huge help.

To get that benefit, you must put your data in very specific formats so that the vector processor can load multiple data simultaneously, process it in parallel, and then write it back out simultaneously. You need to organize things such that the math avoids most conditionals (because looking at the results too soon means a roundtrip to the NEON). Vector programming is a different way of thinking about your program. It's all about pipeline management.

Now, for many very common kinds of problems, the compiler automatically can work all of this out. But it's still about working with numbers, and numbers in particular formats. For example, you almost always need to get all of your numbers into a contiguous block in memory. If you're dealing with fields inside of structs and classes, the NEON can't really help you. It's not a general-purpose "do stuff in parallel" engine. It's an SIMD processor for doing parallel math.

For very high-performance systems, data format is everything. You don't take arbitrary data formats (structs, classes, etc.) and try to make them fast. You figure out the data format that will let you do the most parallel work, and you write your code around that. You make your data contiguous. You avoid memory allocation at all costs. But this isn't really something a simple StackOverflow question can address. High-performance programming is a whole skill set and a different way of thinking about things. It isn't something you get by finding the right compiler flag. As you've found, the defaults are pretty good already.

The real question you should be asking is whether you could reorganize your data so that you can use more of OpenCV. OpenCV already has lots of optimized parallel operations that will almost certainly make good use of the NEON. As much as possible, you want to keep your data in the format that OpenCV works in. That's likely where you're going to get your biggest improvements.

My experience is that it is certainly possible to hand-write NEON assembly that will beat clang and gcc (at least from a couple of years ago, though the compiler certainly continues to improve). Having excellent ARM optimization is not the same as NEON optimization. As @Mats notes, the compiler will generally do an excellent job at obvious cases, but does not always handle every case ideally, and it is certainly possible for even a lightly skilled developer to sometimes beat it, sometimes dramatically. (@wallyk is also correct that hand-tuning assembly is best saved for last; but it can still be very powerful.)

That said, given your statement "Assembly, for which I have absolutely no background, and can't possibly afford to learn at this point," then no, you should not even bother. Without first at least understanding the basics (and a few non-basics) of assembly (and specifically vectorized NEON assembly), there is no point in second-guessing the compiler. Step one of beating the compiler is knowing the target.

If you are willing to learn the target, my favorite introduction is Whirlwind Tour of ARM Assembly. That, plus some other references (below), were enough to let me beat the compiler by 2-3x in my particular problems. On the other hand, they were insufficient enough that when I showed my code to an experienced NEON developer, he looked at it for about three seconds and said "you have a halt right there." Really good assembly is hard, but half-decent assembly can still be better than optimized C++. (Again, every year this gets less true as the compiler writers get better, but it can still be true.)

ARM Assembly language
A few things iOS developers ought to know about the ARM architecture (iPhone-focused, but the principles are the same for all uses.)
ARM NEON support in the ARM compiler
Coding for NEON

One side note, my experience with NEON intrinsics is that they are seldom worth the trouble. If you're going to beat the compiler, you're going to need to actually write full assembly. Most of the time, whatever intrinsic you would have used, the compiler already knew about. Where you get your power is more often in restructuring your loops to best manage your pipeline (and intrinsics don't help there). It's possible this has improved over the last couple of years, but I would expect the improving vector optimizer to outpace the value of intrinsics more than the other way around.

Coding for ARM NEON: How to start?

7 Answers