Technically, why are processes in Erlang more efficient than OS threads?

Question

Erlang's Characteristics

From Erlang Programming (2009):

Erlang concurrency is fast and scalable. Its processes are lightweight in that the Erlang virtual machine does not create an OS thread for every created process. They are created, scheduled, and handled in the VM, independent of underlying operating system. As a result, process creation time is of the order of microseconds and independent of the number of concurrently existing processes. Compare this with Java and C#, where for every process an underlying OS thread is created: you will get some very competitive comparisons, with Erlang greatly outperforming both languages.

From Concurrency oriented programming in Erlang (pdf) (slides) (2003):

We observe that the time taken to create an Erlang process is constant 1µs up to 2,500 processes; thereafter it increases to about 3µs for up to 30,000 processes. The performance of Java and C# is shown at the top of the figure. For a small number of processes it takes about 300µs to create a process. Creating more than two thousand processes is impossible.

We see that for up to 30,000 processes the time to send a message between two Erlang processes is about 0.8µs. For C# it takes about 50µs per message, up to the maximum number of processes (which was about 1800 processes). Java was even worse, for up to 100 process it took about 50µs per message thereafter it increased rapidly to 10ms per message when there were about 1000 Java processes.

My thoughts

I don't fully understand technically why Erlang processes are so much more efficient in spawning new processes and have much smaller memory footprints per process. Both the OS and Erlang VM have to do scheduling, context switches, and keep track of the values in the registers and so on...

Simply why aren't OS threads implemented in the same way as processes in Erlang? Do they have to support something more? And why do they need a bigger memory footprint? And why do they have slower spawning and communication?

Technically, why are processes in Erlang more efficient than OS threads when it comes to spawning and communication? And why can't threads in the OS be implemented and managed in the same efficient way? And why do OS threads have a bigger memory footprint, plus slower spawning and communication?

More reading

Before attempting to understand the reason why a hypothesis is true, you need to establish whether the hypothesis is true -- e.g., supported by the evidence. Do you have references for any like-for-like comparisons demonstrating that an Erlang process actually is more efficient than (say) a Java thread on an up-to-date JVM? Or a C app using OS process and thread support directly? (The latter seems very, very unlikely to me. The former only somewhat likely.) I mean, with a limited enough environment (Francisco's point), it might be true, but I'd want to see the numbers. — T.J. Crowder
@Donal: As is the case with so many other absolute statements. :-) — T.J. Crowder
@Jonas: Thanks, but I got as far as the date (1998-11-02) and JVM version (1.1.6) and stopped. Sun's JVM has improved a fair bit in the last 11.5 years (and presumably Erlang's interpreter has as well), particularly in the area of threading. (Just to be clear, I'm not saying that the hypothesis isn't true [and Francisco and Donal have pointed out why Erland may be able to do something there]; I'm saying it shouldn't be taken at face value without being checked.) — T.J. Crowder
@Jonas: "...but I guess you can do it in Erlang..." It's that "guess" part, dude. :-) You're guessing that Erlang's process switching scales up past the thousands. You're guessing that it does so better than Java or OS threads. Guessing and software dev are not a great combination. :-) But I think I've made my point. — T.J. Crowder
@T.J. Crowder: Install erlang and run erl +P 1000100 +hms 100 and than type {_, PIDs} = timer:tc(lists,map,[fun(_)->spawn(fun()->receive stop -> ok end end) end, lists:seq(1,1000000)]). and than wait about three minutes for result. That's so simple. It takes 140us per process and 1GB whole RAM on mine laptop. But it is directly form shell, it should be better from compiled code. — Hynek -Pichi- Vychodil

Marcelo Cantos Marcelo Cantos · Accepted Answer · 2010-04-25T11:54:45

There are several contributing factors:

Erlang processes are not OS processes. They are implemented by the Erlang VM using a lightweight cooperative threading model (preemptive at the Erlang level, but under the control of a cooperatively scheduled runtime). This means that it is much cheaper to switch context, because they only switch at known, controlled points and therefore don't have to save the entire CPU state (normal, SSE and FPU registers, address space mapping, etc.).
Erlang processes use dynamically allocated stacks, which start very small and grow as necessary. This permits the spawning of many thousands — even millions — of Erlang processes without sucking up all available RAM.
Erlang used to be single-threaded, meaning that there was no requirement to ensure thread-safety between processes. It now supports SMP, but the interaction between Erlang processes on the same scheduler/core is still very lightweight (there are separate run queues per core).

Technically, why are processes in Erlang more efficient than OS threads?

Erlang's Characteristics

My thoughts

More reading

7 Answers