5
votes

One of the difficulties in writing algorithms or data structures that satisfy lock-free progress guarantees is dynamic memory allocation: calling something like malloc or new isn't guaranteed to be lock-free in a portable manner. However, many lock-free implementations of malloc or new exist, and there are also a variety of lock-free memory allocators that can be used to implement lock-free algorithms/data-structures.

However, I still don't understand how this can actually completely satisfy lock-free progress guarantees, unless you specifically limit your data-structure or algorithm to some pre-allocated static memory pool. But if you need dynamic memory allocation, I don't understand how any alleged lock-free memory allocator can ever be truly lock-free in the long-run. The problem is that no matter how amazing your lock-free malloc or new might be, ultimately you may run out of memory, at which point you have to fall back on asking the operating system for more memory. That means that ultimately you have to call brk() or mmap() or some such low-level equivalent to actually get access to more memory. And there is simply no guarantee that any of these low-level calls are implemented in a lock-free manner.

There's simply no way around this, (unless you're using an ancient OS like MS-DOS that doesn't provide memory protection, or you write your own completely lock-free operating system - two scenarios that are not practical or likely.) So, how can any dynamic memory allocator truly be lock-free?

1
You're right, but... Since the allocator only requests large blocks from the OS, and is not required to release them in a timely manner, the total number of calls to sbrk() or whatever can be strictly limited to the point where it doesn't matter. This doesn't satisfy hard real-time requirements, of course, but I don't think dynamic allocation would satisfy such requirements even if it was completely lock-free.Matt Timmermans
Are you worried about latency, or are you worried about using a en.wikipedia.org/wiki/Non-blocking_algorithm? If the latter, I think most good OSes make it impossible for a thread to sleep inside the kernel while holding a lock that would block other threads from using mmap. So as far as writing a non-blocking algorithm, I don't think it's technically a problem to use mmap, because locks are only a problem if it's possible for a thread to sleep while holding one. A couple CPU cache misses inside the critical section are probably the worst you'd get.Peter Cordes
All allocators running on a real system have to run out of memory at some point; a lock-free allocator running out when its static pool is allocated isn't qualitatively different to a normal allocator running out when the OS decides you've had enough.James Picone
@PeterCordes: But if you have a dozen cores in your system, and all are busy allocating memory [you probably should consider your algorithm in that case, but never mind that], and they are all trying to get hold of the same lock, they will all have to pass through single-file through the lock. Whilst it may not cause the process to SLEEP, it will cause it to WAIT. Which is about the same thing.Mats Petersson
@MatsPetersson: Yeah it's obviously not wait-free, but you could consider it technically lock-free and obstruction-free. The latter just require that at least one thread can make forward progress at any given time. In practice those Computer Science terms aren't what you care about, and you actually care about scalability when there's contention. Thus, lockless algorithms (using atomics) are often a win, even if they aren't lock-free (e.g. a lockless queue, good analysis of the difference here: stackoverflow.com/a/45910129/224132).Peter Cordes

1 Answers

7
votes

As you have found yourself, the fundamental OS allocator is most likely not lock-free, because it has to deal with multiple processes and all manner of interesting stuff that makes it really hard to not introduce some sort of lock.

For some cases, however, the "lock free memory allocation" doesn't mean "never locks", but "statistically locks so rarely that it doesn't really matter". Which is fine for anything but the most strict real-time systems. If you don't have high contention on your lock, then lock or no lock doesn't really matter - the purpose of lock-free is not really the overhead of the lock itself, but the ease with which it becomes a bottle-neck where every thread or process in the system has to pass through this one place to do anything useful, and as it does so, it has to wait in queue [it may not be a true queue either, it may be "whoever wakes first" or some other mechanism that decides who comes out next, after the current caller].

There are a few different options to solve this:

  • If you have a memory pool with a finite size, you can ask the OS for all that memory at once, when the software is being started. And after the memory has been chunked out from the OS, it can be used as a lock-free pool. The obvious drawback is that it has a limit to how much memory can be allocated. You then either have to stop allocating (fail the application alltogether, or fail that particular operation).

    Of course, in a system like Linux or Windows, there is still no guarantee that memory allocation, in a lock-free scenario, means "instant access to the allocated memory", since the system can and will allocate memory without actual physical memory backing to it, and only once the memory is actually being used, the physical memory page is assigned to it. This may both involve locks and for example disk-I/O to page out some other page to the swap.

  • For such strict real-time systems that the time of a single system call that may contend for a lock is "too much", the solution is of course to use a dedicated OS, one that has a lock-free allocator inside the OS (or at least one that has a known real-time behaviour that is acceptable - it locks for at most X microsecons [X can be less than 1.0]). Real-time systems often have a pool of memory and fixed size buckets for recycling old allocations, which can be done in a lock-free manner - the buckets are a linked list, so you can insert/remove from that list with atomic compare&exchange operations [possibly with retry, so although it's technically lock-free, it's not zero wait time in contended situations].

  • Another solution that can work is to have "per thread pools". This can get a bit complicated if you pass data between threads, but if you either accept that "memory freed for reuse may end up in a different thread" (which of course leads to problems along the lines of "all the memory now sits in that one thread that collects and frees the information from many other threads, and all other threads have run out of memory")