I'm benchmarking a library against ARMv8 machines. I have four Cortex-A53 dev-boards, and our NEON intrinsics implementation outperforms the C/C++ implementation by about 30%. This is expected.
The GCC compile farm offers a Softiron Overdrive 1000. Its a Cortex-A57 server board, and the C/C++ code outperforms the intrinsics implementation by a factor of 50%. This was surprising.
We'd like to use our NEON implementation for A-53, but use the C/C++ implementation for the A57. We have code that can make runtime feature selections, like HasNEON(), HasCRC(), HasAES() and HasSHA(). We don't have anything for the architecture, like A53 vs A57.
My question is, how do we detect an A53 vs A57 at runtime?
We have similar code for x86 code paths for the P4 processor. The P4 has some slow word operations. We detect the P4 by checking CPUID bits, but ARM systems are different. ARM systems the CPUID-like instruction is reading a MSR, and it usually requires a higher privilege level (EL1 or above).
If interested, the Cortex-A57 is slower for a particular hash algorithm because it relies heavily on shifts, rotates and xors. The A57 Optimization guide tells us shifts and rotates are more expensive. It takes 4 or 5 cycle in the ASIMD coprocessor for the shift, and only the F1 pipe can perform the operation (per section 3.14).
It could also be the Cortex-A53 has the same penalty, and its integer unit is slower so non-NEON code does not outperform the NEON code.