0
votes

I searched for a answer and I know at a high level how to use lock etc. in a multithreading environment. This question has bugged me for a long time and I think I am not the only one. TL;DR at the end.

In my case I want to prevent a method which gets called from multiple threads to be executed while it is being called by another thread.

Now a normal lock scenario would look like this in C#:

static readonly object _locker = new object();

private static int counter;

public void Increase()
{
  lock (_locker)
  {
    _counter++;//Do more than this here
  }
}

As I understand it, the object _locker acts as a bool, which indicates if the method is currently being executed. If method is "free", set it to locked and execute method and free afterwards. If method is locked, wait until unlocked, lock, execute and unlock.

Sidequestion 1: Does calling this method repeatedly guarantee a queue like behavior? Ignoring the fact that the blocking in the parent thread could cause problems. Imagine Increase() is the last call in the parent thread.

Sidequestion 2: Using an object in a boolean way feels odd. Does every object contain a "is-being-used" flag and a "raw" object is just being used for containing this flag? Why not Boolean?

Sidequestion 3: how can lock() modify a readonly?

Functionally I could write it like this also:

static Boolean _locker = false;

private static int counter;

public void Increase()
{
  while(_locker)//Waits for lock==0
  {
  }

  _locker = true;//Sets lock=1

  _counter++;//Do more than this here

  _locker = false;//Sets lock=0
}

While the lock example looks sophisticated and safe, the second one just feels wrong, and somehow rings alarm bells in my head.

Is it possible for this method to be executed at the exact same CPU cycle by two cores simultaneously?

I know this is "but sometimes" taken to the extreme. I believe an OS-scheduler does split threads from one application to multiple cores, so why shouldn't the assembly instruction "Load value of _locked for comparison" be executed on two cores at the same time? Even if the method is entered one cycle apart the "read for comparison" and "write true to _locked" would be executed at the same time.

This doesn't even take into account that one line of C# could/will translate to multiple assembly instructions and a thread could be interrupted after confirming locked==0 and writing locked=1. Because one line of C# can result in many assembly instructions, even the lock() could be interrupted?

Obviously these problems are somehow solved or avoided, I would really appreciate a explanation of where my thought process is wrong or what I am missing.

TL;DR Can a lock() statement be executed at the exact same time by two CPU cores? I can't explain avoidance of this by software without big performance impacts.

1
you second code snippet is flawed and will have concurrency issues. Use a lock(). uncontended Locks are sufficiently fast; don't assume you have a performance problem until you have measured one.Mitch Wheat
TLDR; locks use CPU features to ensure that only one thread can enter the lock at a time. objects contain whatever internal runtime state is required to implement this feature, you don't need to know what that is. Why object? Because that's how Java did it.Jeremy Lakeman
lock() is guaranteed to be atomic (can't be interrupted between checking if it's locked and setting the lock state), but your while loop isn't. Feedback for your question, don't ask multiple questions in a post.Sweeper

1 Answers

3
votes

Yes, two cores can take two different locks at the same time. The atomic RMW operation only needs a "cache lock", not a global bus lock, on modern CPUs. e.g. this test code (on Godbolt) is C++ code that compiles to a loop that just repeats an xchg [rdi], ecx, with each thread using a different std::atomic<int> object in a different cache line. The total runtime of the program on my i7-6700k is 463ms whether it runs on 1 or 4 threads, so that rules out any kind of system-wide bus lock, confirming that the CPU just uses a MESI cache-lock within the core doing the RMW to make sure it's atomic without disturbing operations of other cores. Uncontended locks scale perfectly when each thread is only locking/unlocking its own lock repeatedly.

Taking a lock that was last released by another core will stall this one for maybe hundreds of clock cycles (40 to 70 nanoseconds is a typical inter-core latency) for the RFO (Read For Ownership) to complete and get exclusive ownership of the cache line, but won't have to retry or anything. Atomic RMW involves a memory barrier (on x86), so memory operations after the lock can't even get started, so the CPU core may be stalled for a while. There is significant cost here, compared to normal loads/stores, which out-of-order exec can't hide as well as some other things.


No, two cores can't take the same lock at the same time1, that's the whole point of a mutex. Correctly-implemented ones don't have the same bug as your example of spin-wait and then separately store a true.

(Note 1: There are counted locks / semaphores that you can use to allow up to n threads into a critical section, for some fixed n, where the resource management problem you want to solve is something other than simple mutual exclusion. But you're only talking about mutexes.)


The critical operation in taking a lock is an atomic RMW, for example x86 xchg [rcx], eax or lock cmpxchg [rcx], edx, that stores a 1 (true) and as part of the same operation checks what the old value was. (Can num++ be atomic for 'int num'?). In C++, that would mean using std::atomic<bool> lock; / old = lock.exchange(true); In C#, you have Interlocked.Exchange(). That closes the race window your attempt contained, where two threads could exit the while(_locker){} loop and then both blindly store a _locker = true.

Also note that rolling your own spin-loop has problems if you don't use volatile or Volatile.Read() to stop the compiler from assuming that no other threads are writing a variable you're reading/writing. (Without volatile, while(foo){} can optimize into if(!foo) infinite_loop{} by hoisting the apparently loop-invariant load out of the loop).

(The other interesting part of implementing a lock is what to do if it's not available the first time you try. e.g. how long you keep spinning (and if so exactly how, e.g. the x86 pause instruction between read-only checks), using CPU time while waiting, before falling back to making a system call to give up the CPU to another thread or process, and have the OS wake you back up when the lock is or might be available again. But that's all performance tuning; actually taking the lock revolves around an atomic RMW.)


Of course, if you're going to do any rolling-your-own, make the increment itself a lock-free atomic RMW with Interlocked.Increment(ref counter);, as per the example in MS's docs


Does every object contain a "is-being-used" flag and a "raw" object is just being used for containing this flag ? Why not Boolean ?

We know from object sizes that C# doesn't do that. Probably you should just use lock (counter){ counter++; } instead of inventing a separate. Using a dummy object would make sense if you didn't have an existing object you wanted to manage, but instead some more abstract resource like calling into some function. (Correct me if I'm wrong, I don't use C#; I'm just here for the cpu-architecture and assembly tags. Does lock() require an object, not a primitive type like int?)

I'd guess that they instead do what normal C++ implementations of std::atomic<T> does for objects too large to be lock-free: a hash table of actual mutexes or spinlocks, indexed by C# object address. Where is the lock for a std::atomic?

Even if that guess isn't exactly what C# does, that's the kind of mental model that can make sense of this ability to lock anything without using reserving space in every object.

This can create extra contention (by using the same mutex for two different objects). It could even introduce deadlocks where there shouldn't have been any, which is something the implementation would have to work around. Perhaps by putting the identity of the object being locked into the mutex, so another thread that indexes the same mutex can see that it's actually being used to lock a different object, and then do something about it... This is perhaps where being a "managed" language comes in; Java apparently does the same thing where you can lock any object without having to define a separate lock.

(C++ std::atomic doesn't have this problem because the mutexes are taken/released inside library functions, with no possibility to try to take two locks at the same time.)


Do CPU cores tick at the same time?

Not necessarily, e.g. Intel "server" chips (most Xeons) let each core control its frequency-multiplier independently. However, even in a multi-socket system, the clock for all cores is normally still derived from the same source, so they can keep their TSC (which counts reference cycles, not core clocks) synced across cores.

Intel "client" chips, like desktop/laptop chips such as i7-6700, actually do use the same clock for all cores. A core is either in low-power sleep (clock halted) or running at the same clock frequency as any other active cores.

None of this has anything to do with locking, or making atomic RMW operations truly atomic, and probably should be split off to a separate Q&A. I'm sure there are plenty of non-x86 examples, but I happen to know how Intel CPUs do things.