Erlang: blocking C NIF call behavior

Question

I have observed a blocking behavior of C NIFs when they were being called concurrently by many Erlang processes. Can it be made non-blocking? Is there a mutex at work here which I'm not able to comprehend?

P.S. A basic "Hello world" NIF can be tested by making it sleep for a hundred microseconds in case of a particular PID calling it. It can be observed that the other PIDs calling the NIF wait for that sleep to execute before their execution.

Non blocking behavior would be beneficial in cases where concurrency might not pose an issue(e.g. array push, counter increment).

I am sharing the links to 4 gists which comprise of a spawner, conc_nif_caller and niftest module respectively. I have tried to tinker with the value of Val and I have indeed observed a non-blocking behavior. This is confirmed by assigning a large integer parameter to the spawn_multiple_nif_callers function.

Links spawner.erl,conc_nif_caller.erl,niftest.erl and finally niftest.c.

The line below is printed by the Erlang REPL on my Mac.

Erlang/OTP 17 [erts-6.0] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

mpm mpm · Accepted Answer · 2014-10-23T18:09:35

NIF's themselves don't have any mutex. You could implement one in C, and there is one when you load NIF's object, but this should be done only once with loading module.

One thing that's might be happening (and I would bet that's what is going on), is you C code messes up Erlang scheduler(s).

A native function that do lengthy work before returning will degrade responsiveness of the VM, and may cause miscellaneous strange behaviors. Such strange behaviors include, but are not limited to, extreme memory usage, and bad load balancing between schedulers. Strange behaviors that might occur due to lengthy work may also vary between OTP releases.

and description of what lengty work means and how you could solve it.

In very few words (with quite few simplifications):

For core one scheduler is created. Each has a list of processes which he can run. If ones scheduler list is empty, he will try to still work from another one. This can fail, if there is nothing (or not enough) to still.

Erlang schedulers spends some amount of work in one process, than moves to another, spend there some amount of work, and move to another. And so on, and so one. This is very similar to scheduling in system processes.

One thing that very important here is calculating amount of work. As default each function call has assigned some number of reductions. Addition could have two, calling function in your module will have one, sending a message also a one, some build-in could have more (like list_to_binary). If we collect 2 000 reductions we move to another process.

So what is the cost of your C function? It's only one reduction.

Code like

loop() ->
   call_nif_function(),
   loop().

could be taking all whole hour, but scheduler will be stuck in this one process, because he still haven't count to 2 000 reductions. Or to put it in other words, he could be stuck inside NIF without possibility to move forward (at least any time soon).

There are few ways around this but general rule is stat NIF's should not take long time. So if you have long running C code, maybe you should use drivers instead. They should be much easier to implement and manage, that tinkering with NIF's.

Erlang: blocking C NIF call behavior

3 Answers