Linking hard float to softfp bad performance

Question

I'm writing c++ code to run on ARM cortex a9 CPU. My code links to a closed source 3rd party library which is compiled with soft-float. I'm running a cortex-a9 ARM cpu.

I noticed that if I compile my code with gcc flag -mfloat-abi=softfp, it runs much faster than compiling with -mfloat-abi=hard.

I thought hard-float should always be faster. Does it make sense?

How can I optimize these library calls?

Thanks!

Some notes:

The library interface is built of integers, strings and pointers only, and it works fine.
Speedup is about x8 in favor of the softfp.
readelf platform related information regarding the binaries:

3rd party library:

readelf -hA libXXX.so
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           ARM
  Version:                           0x1
  Entry point address:               0x13780
  Start of program headers:          52 (bytes into file)
  Start of section headers:          1617724 (bytes into file)
  Flags:                             0x4000002, has entry point, Version4 EABI
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         28
  Section header string table index: 27
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "ARM9TDMI"
  Tag_CPU_arch: v4T
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-1
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align8_needed: Yes
  Tag_ABI_align8_preserved: Yes, except leaf SP
  Tag_ABI_enum_size: int
  Tag_ABI_optimization_goals: Aggressive Speed

my binary:

readelf -hA XXX
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           ARM
  Version:                           0x1
  Entry point address:               0x1b0d4
  Start of program headers:          52 (bytes into file)
  Start of section headers:          1392964 (bytes into file)
  Flags:                             0x5000002, has entry point, Version5 EABI
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         8
  Size of section headers:           40 (bytes)
  Number of section headers:         38
  Section header string table index: 35
Attribute Section: aeabi
File Attributes
  Tag_CPU_name: "Cortex-A9"
  Tag_CPU_arch: v7
  Tag_CPU_arch_profile: Application
  Tag_ARM_ISA_use: Yes
  Tag_THUMB_ISA_use: Thumb-2
  Tag_VFP_arch: VFPv3
  Tag_NEON_arch: NEONv1
  Tag_ABI_PCS_wchar_t: 4
  Tag_ABI_FP_denormal: Needed
  Tag_ABI_FP_exceptions: Needed
  Tag_ABI_FP_number_model: IEEE 754
  Tag_ABI_align8_needed: Yes
  Tag_ABI_align8_preserved: Yes, except leaf SP
  Tag_ABI_enum_size: int
  Tag_ABI_HardFP_use: SP and DP
  Tag_ABI_VFP_args: VFP registers
  Tag_unknown_34: 1 (0x1)
  Tag_unknown_42: 1 (0x1)
  Tag_unknown_44: 1 (0x1)
  Tag_unknown_68: 1 (0x1)

hard-float is faster if you use it properly. I don't think that gcc will automatically generate veneers between the two. So you may be having undefined behavior, which can result in slow performance. Ie, some bad code (non-conforming code) can run slower with -O3 versus -O0 because the code ends up doing something undefined and it compiles to something completely different with different optimizations. — artless noise
So, you are doing something you shouldn't (afaik); Linking hard-float with softfp. The register passing convention is different. I believe it is possible to write some sort of assembler library that would convert between the two; you would shim all the 3rd party API and convert from the hf register to stack based passing. — artless noise
In my experience this situation ends abruptly with the linker screaming "Error: X uses VFP register arguments, Y does not". Clearly some clever trickery is happening here... — Notlikethat
Thanks guys, I was hoping there might be something else out there. I guess I'll have to use two separate processes, one hardfloat and one softfloat that links to the 3rd party, communicating with IPC and shared mem... That's the easiest solution, isn't it? — oferlivny

ams ams · Accepted Answer · 2014-08-01T15:59:29

The two ABIs selected by -mfloat-abi=softfp and -mfloat-abi=hard are not compatible. You cannot mix and match.

Typically, you can't even use softfp processes on a hardfp system unless you have all the libraries duplicated in different lib directories (i.e. "multiarch").

If your code happens not to use float or double types in function parameters then you might find that it actually does work, in practice, but you still should not do it, or you are playing with fire.

In any case, if your code is entirely integer-based then these options will make no difference to the code generated, so the performance changes must be coming from somewhere else. Perhaps the compiler you are using automatically selects a different multilib, or a different CPU when you specify the -mfloat-abi option unexpectedly (GCC has a habit of switching back to the default multilib, in particular). Maybe you switch on NEON by mistake, or switch from A8 tuning to A9?

Linking hard float to softfp bad performance

1 Answers