6
votes

Microsoft's documention of Parallel.For contains the following method:

static void MultiplyMatricesParallel(double[,] matA, double[,] matB, double[,] result)
{
    int matACols = matA.GetLength(1);
    int matBCols = matB.GetLength(1);
    int matARows = matA.GetLength(0);

    // A basic matrix multiplication.
    // Parallelize the outer loop to partition the source array by rows.
    Parallel.For(0, matARows, i =>
    {
        for (int j = 0; j < matBCols; j++)
        {
            double temp = 0;
            for (int k = 0; k < matACols; k++)
            {
                temp += matA[i, k] * matB[k, j];
            }
            result[i, j] = temp;
        }
    }); // Parallel.For
}

In this method, potentially multiple threads read values from matA and matB, which were both created and initialized on the calling thread, and potentially multiple threads write values to result, which is later read by the calling thread. Within the lambda passed to Parallel.For, there is no explicit locking around the array reads and writes. Because this example comes from Microsoft, I assume it's thread-safe, but I'm trying to understand what's going on behind the scenes to make it thread-safe.

To the best of my understanding from what I've read and other questions I've asked on SO (for example this one), several memory barriers are needed to make this all work. Those are:

  1. a memory barrier on the calling thread after creating and intializing matA and matB,
  2. a memory barrier on each non-calling thread before reading values from matA and matB,
  3. a memory barrier on each non-calling thread after writing values to result, and
  4. a memory barrier on the calling thread before reading values from result.

Have I understood this correctly?

If so, does Parallel.For do all of that somehow? I went digging in the reference source but had trouble following the code. I didn't see any lock blocks or MemoryBarrier calls.

4
I expect that Parallel.For() has a memory barrier at the start and end of it, have you looked at the source code for Parallel.For()?Ian Ringrose
@IanRingrose, yes, as mentioned and linked to in my question. I didn't find any MemoryBarrier calls or lock blocks.adv12
But what about in the methods it calls, I expect it sits on top of the task system, and that somewhere in the task system are memory barriers where task start and end.Ian Ringrose
Yes, I expect so. Haven't fully traced the code yet.adv12
@HansPassant I do trust that MS gets it right, but I've written threaded code that is wrong, and understanding exactly what is required to get it right helps me when I have to do it myself, and more importantly helps me know exactly what I can count on MS to do for me vs what locking, etc I still need to implement when using Microsoft's patterns and libraries.adv12

4 Answers

6
votes

Since the array is already created, writing or reading from it won't cause any resizes. Also, the code itself prevents reading/writing the same position in the array.

Bottom line is that the code always can calculate the position to read and write in the array and those calls never cross each other. Hence, it is thread-safe.

6
votes

The memory barriers you are looking for are inside the task scheduler.

ParallelFor breaks the work up into tasks, and a work-stealing scheduler executes those task. The minimal memory barriers required for a work-stealing scheduler are:

  1. A "release" fence after a task is created.
  2. An "acquire" fence when a task is stolen.
  3. A "release" fence when a stolen task completes.
  4. An "acquire" fence for the thread that is waiting on the task.

Look here for where 1 is implied by the atomic ("Interlocked") operations used to enqueue a task. Look here where 2 is implied by atomic operations, volatile reads, and/or locks when a task is stolen.

I have not been able to track down where 3 and 4 are. 3 and 4 are likely implied by some kind of atomic join counter.

3
votes

Inside the Threads (actually: Tasks) access to matA and matB is read-only, result is write-only.

Parallel reading is inherently thread-safe, the writing is thread-safe because the i variable is unique for each Task.

There is no need for memory barriers in this piece of code (other then before/after the entire Parallel.For but those can be assumed).

Form you numbered items,
1) and 4) are implied by the Parallel.For()
2) and 3) are simply not needed.

2
votes

I think that you're really impressed by idea of memory barriers, but I really can't understand your concerns. Let's take a look for the code you've investigated:

  1. 3 arrays are initiated and filled with values in main thread. So it's like you assign a variable a value and call a method - the CLR ensures that your method gets fresh value for the arguments. The possible issue here could take a place if the initialization is done in background and/concurrently by some other thread. In this case you're right and you need some synchronization constructs here, either memory barrier or lock statement or other techniques.

  2. A code for a parallel execution gets all values from 0 to matARows and creates a task array for them. You need to understand two different ways to parallelize your code: by operations and by data. Right here we have a multiple rows with the same lambda operation for them. The assignments for temp variable are not shared, so they are thread-safe and no memory barrier is needed, as there are no old and new values. Again, as in first place, if some other thread updates initial matrices, you need a synchronization constructs here.

  3. Parallel.For ensures that all the tasks are done (ran to completion, are cancelled or faulted) until it proceeds to the next statements, so the code inside loop will be executed as a normal method. Why don't you need a barrier here? Because all the write operations are done on different rows, and there is no intersection between them, so it is a data parallelism. However, as in other cases, if other threads needed for a new value from some of loop iterations, you still need a synchronization. So this code is thread safe because it geometrically parallel on data, and don't create a race conditions.

This example is very simple, and real algorithm in general needs more complicated logic. You can investigate various methods forproving the code to be thread safe without using a synchronization for the codeto be a lock-free.