Though you cannot "reopen" instances in Haskell like you could with classes in dynamic languages, there are ways to ensure that functions will be aggressively inlined whenever possible by passing certain flags to GHC.
-fspecialise-aggressively
removes the restrictions about which functions are specialisable. Any overloaded function will be
specialised with this flag. This can potentially create lots of
additional code.
-fexpose-all-unfoldings
will include the (optimised) unfoldings of all functions in interface files so that they can be inlined and
specialised across modules.
Using these two flags in conjunction will have nearly the same effect
as marking every definition as INLINABLE
apart from the fact that the
unfoldings for INLINABLE
definitions are not optimised.
(Source: https://wiki.haskell.org/Inlining_and_Specialisation#Which_flags_can_I_use_to_control_the_simplifier_and_inliner.3F)
These options will allow the GHC compiler to inline fmap
. The -fexpose-all-unfoldings
option, in particular, allows the compiler to expose the internals of Data.Functor
to the rest of the program for inlining purposes (and it seems to provide the largest performance benefit). Here's a quick & dumb benchmark I threw together:
functor.hs
contains this code:
{-
{-
data Foo a = MakeFoo a a deriving (Functor)
one_fmap foo = fmap (+1) foo
main = sequence (fmap (\n -> return $ one_fmap $ MakeFoo n n) [1..10000000])
Compiled with no arguments:
$ time ./functor
real 0m4.036s
user 0m3.550s
sys 0m0.485s
Compiled with -fexpose-all-unfoldings
:
$ time ./functor
real 0m3.662s
user 0m3.258s
sys 0m0.404s
Here's the .prof
file from this compile, to show that the call to fmap
is indeed getting inlined:
Sun Oct 7 00:06 2018 Time and Allocation Profiling Report (Final)
functor +RTS -p -RTS
total time = 1.95 secs (1952 ticks @ 1000 us, 1 processor)
total alloc = 4,240,039,224 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
CAF Main <entire-module> 100.0 100.0
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 44 0 0.0 0.0 100.0 100.0
CAF Main <entire-module> 87 0 100.0 100.0 100.0 100.0
CAF GHC.IO.Handle.FD <entire-module> 84 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 77 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 71 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 58 0 0.0 0.0 0.0 0.0
Compiled with -fspecialise-aggressively
:
$ time ./functor
real 0m3.761s
user 0m3.300s
sys 0m0.460s
Compiled with both flags:
$ time ./functor
real 0m3.665s
user 0m3.213s
sys 0m0.452s
These little benchmarks are by no means representative of what the performance (or filesize) will like in real code, but it definitely shows that you can force the GHC compiler to inline fmap
(and that it really can have non-negligible effects on performance).
deriving Functor
using template haskell, which might get you there (maybe it needs a mod). In fact I think I have one right here. – luqui