I'm writing c++ code to run on ARM cortex a9 CPU. My code links to a closed source 3rd party library which is compiled with soft-float. I'm running a cortex-a9 ARM cpu.
I noticed that if I compile my code with gcc flag -mfloat-abi=softfp, it runs much faster than compiling with -mfloat-abi=hard.
I thought hard-float should always be faster. Does it make sense?
How can I optimize these library calls?
Thanks!
Some notes:
- The library interface is built of integers, strings and pointers only, and it works fine.
- Speedup is about x8 in favor of the softfp.
- readelf platform related information regarding the binaries:
3rd party library:
readelf -hA libXXX.so
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: ARM
Version: 0x1
Entry point address: 0x13780
Start of program headers: 52 (bytes into file)
Start of section headers: 1617724 (bytes into file)
Flags: 0x4000002, has entry point, Version4 EABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 7
Size of section headers: 40 (bytes)
Number of section headers: 28
Section header string table index: 27
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "ARM9TDMI"
Tag_CPU_arch: v4T
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-1
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align8_needed: Yes
Tag_ABI_align8_preserved: Yes, except leaf SP
Tag_ABI_enum_size: int
Tag_ABI_optimization_goals: Aggressive Speed
my binary:
readelf -hA XXX
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: ARM
Version: 0x1
Entry point address: 0x1b0d4
Start of program headers: 52 (bytes into file)
Start of section headers: 1392964 (bytes into file)
Flags: 0x5000002, has entry point, Version5 EABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 8
Size of section headers: 40 (bytes)
Number of section headers: 38
Section header string table index: 35
Attribute Section: aeabi
File Attributes
Tag_CPU_name: "Cortex-A9"
Tag_CPU_arch: v7
Tag_CPU_arch_profile: Application
Tag_ARM_ISA_use: Yes
Tag_THUMB_ISA_use: Thumb-2
Tag_VFP_arch: VFPv3
Tag_NEON_arch: NEONv1
Tag_ABI_PCS_wchar_t: 4
Tag_ABI_FP_denormal: Needed
Tag_ABI_FP_exceptions: Needed
Tag_ABI_FP_number_model: IEEE 754
Tag_ABI_align8_needed: Yes
Tag_ABI_align8_preserved: Yes, except leaf SP
Tag_ABI_enum_size: int
Tag_ABI_HardFP_use: SP and DP
Tag_ABI_VFP_args: VFP registers
Tag_unknown_34: 1 (0x1)
Tag_unknown_42: 1 (0x1)
Tag_unknown_44: 1 (0x1)
Tag_unknown_68: 1 (0x1)
gcc
will automatically generate veneers between the two. So you may be having undefined behavior, which can result in slow performance. Ie, some bad code (non-conforming code) can run slower with-O3
versus-O0
because the code ends up doing something undefined and it compiles to something completely different with different optimizations. – artless noise