Detect ARMv8 A53 vs A57 architecture at runtime?

Question

I'm benchmarking a library against ARMv8 machines. I have four Cortex-A53 dev-boards, and our NEON intrinsics implementation outperforms the C/C++ implementation by about 30%. This is expected.

The GCC compile farm offers a Softiron Overdrive 1000. Its a Cortex-A57 server board, and the C/C++ code outperforms the intrinsics implementation by a factor of 50%. This was surprising.

We'd like to use our NEON implementation for A-53, but use the C/C++ implementation for the A57. We have code that can make runtime feature selections, like HasNEON(), HasCRC(), HasAES() and HasSHA(). We don't have anything for the architecture, like A53 vs A57.

My question is, how do we detect an A53 vs A57 at runtime?

We have similar code for x86 code paths for the P4 processor. The P4 has some slow word operations. We detect the P4 by checking CPUID bits, but ARM systems are different. ARM systems the CPUID-like instruction is reading a MSR, and it usually requires a higher privilege level (EL1 or above).

If interested, the Cortex-A57 is slower for a particular hash algorithm because it relies heavily on shifts, rotates and xors. The A57 Optimization guide tells us shifts and rotates are more expensive. It takes 4 or 5 cycle in the ASIMD coprocessor for the shift, and only the F1 pipe can perform the operation (per section 3.14).

It could also be the Cortex-A53 has the same penalty, and its integer unit is slower so non-NEON code does not outperform the NEON code.

Which OS do you use? Because there isn't something like CPUID for user programs on ARM. — fsasm
@fsasm - We are a Free/Open Source project, so its a cornucopia... Android, Debian, FreeBSD, Linaro, Ubuntu, ... I recently cut-in a change that simply disables NEON on Cortex-A53 and A57. But I'd really like to fix it by disabling NEON code path for A57 at runtime. Also see Issue 367: BLAKE2b NEON suffers poor performance on ARMv8/Aarch64 with Cortex-A57. — jww
In your custom OS, do you have something like cat /proc/cpuinfo? — Isuru H
Possibly a trivial and off-topic question - but have you ruled out compiler differences? If I remember right, the GCC compile farm Softiron machines default to a GCC 4.8 which has a... suboptimal... implementation of many Neon intrinsics. Checking the code generated by the compiler might be useful, you might find some idioms which worked well for AArch32 are generating more instructions than you'd like with the AArch64 compiler. — James Greenhalgh

Brendan Brendan · Accepted Answer · 2017-01-23T02:07:52

Have a tune() function that's called during process initialisation, that benchmarks your implementation and GCC's implementation and caches the result (e.g. in a bool isMyImplementationFaster global variable).

If your implementation is faster you could assume it's an A53 (and if it's slower you could assume it's an A57). Note that this causes a problem/confusion for CPUs (including future CPUs) that are neither A53 nor A57. However; I'm hoping you'll realise that you don't actually care if it's A53 or A57 (or something else), and that you only care if your implementation is faster/slower.

Detect ARMv8 A53 vs A57 architecture at runtime?

2 Answers