3
votes

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:

  • Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
  • Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
  • Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).

Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?

2
Many reductions use shared memory and compute one result per threadblock. The threadblock results then go through a second "global" reduction. It seems like something similar could be applied. It's not difficult to make it thread-safe and less costly if you can work with simply knowing whether a change has occurred rather than the number of changes or which elements have changed. Then you can simply set a shared memory location to zero at first and let any thread in the block set it to 1, at any time, in any order. A similar approach can be used for the 2nd step global reduction.Robert Crovella
Can you explain exactly what the actual stability or convergence criteria you need is?talonmies
@talonmies No modifications to the matrix in the last iteration.user2398029
@RobertCrovella similar idea to 3), which is what I'm doing right now (albeit directly on the matrix instead of on a "flags" matrix). I just can't wrap my head around the fact that such a "complex" process is needed when in the best case scenario a boolean flag would have to be modified once.user2398029
@louism: I had actually written you an answer for this some days weeks ago including some demo code, but a browser crash ate it, sorry. I would do a sum-reduce, but you can do it efficiently using warp voting primitives (the __any() warp vote for example). Then you only need a very simple reduction for the result of each warp within a block, and a single atomic add per block to update a global flag. If the flag is in zero copy memory, then you don't need an explicit copy to inspect the result on the host.talonmies

2 Answers

4
votes

I'll also go back to the answer I would have posted in 2012 but for a browser crash.

The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.

This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.

A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:

#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>

__device__ unsigned int process(int & val)
{
    return (++val < 10);
}

template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
    __shared__ int wchanged[nwarps];
    unsigned int laneid = threadIdx.x % warpSize;
    unsigned int warpid = threadIdx.x / warpSize;

    // Do calculations then check for change/convergence 
    // and set tchanged to be !=0 if required
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int tchanged = process(inout[idx]);

    // Simple blockwise reduction using voting primitives
    // increments kchanged is any thread in the block 
    // returned tchanged != 0
    tchanged = __any(tchanged != 0);
    if (laneid == 0) {
        wchanged[warpid] = tchanged;
    }
    __syncthreads();

    if (threadIdx.x == 0) {
        int bchanged = 0;
#pragma unroll
        for(int i=0; i<nwarps; i++) {
            bchanged |= wchanged[i];
        }
        if (bchanged) {
            atomicAdd(kchanged, 1);
        }
    }
}

int main(void)
{
    const int N = 2048;
    const int min = 5, max = 15;
    std::vector<int> data(N);
    for(int i=0; i<N; i++) {
        data[i] = min + (std::rand() % (int)(max - min + 1));
    }

    int* _data;
    size_t datasz = sizeof(int) * (size_t)N;
    cudaMalloc<int>(&_data, datasz);
    cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);

    unsigned int *kchanged, *_kchanged;
    cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
    cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);

    const int nwarps = 4;
    dim3 blcksz(32*nwarps), grdsz(16);

    // Loop while the kernel signals it needs to run again
    do {
        *kchanged = 0;
        kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
        cudaDeviceSynchronize(); 
    } while (*kchanged != 0); 

    cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
    cudaDeviceReset();

    int minval = *std::min_element(data.begin(), data.end());
    assert(minval == 10);

    return 0;
}

Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.

You should be able to adapt this to whatever your application requires

3
votes

I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.

create a flag in global memory:

__device__ int flag;

at each iteration,

  1. initialize the flag to zero (in host code):

    int init_val = 0;
    cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
    
  2. In your kernel device code, modify the flag to 1 if a change is made to the matrix:

    __global void iter_kernel(float *matrix){
    
    ...
      if (new_val[i] != matrix[i]){
        matrix[i] = new_val[i];
        flag = 1;}
    ...
    }
    
  3. after calling the kernel, at the end of the iteration (in host code), test for modification:

    int modified = 0;
    cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
    if (modified){
      ...
      }
    

Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.

Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).

Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.

I can't see any reason to use atomics, or how it would benefit anything.

A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".