Efficient method to check for matrix stability in CUDA

Question

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:

Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).

Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?

Many reductions use shared memory and compute one result per threadblock. The threadblock results then go through a second "global" reduction. It seems like something similar could be applied. It's not difficult to make it thread-safe and less costly if you can work with simply knowing whether a change has occurred rather than the number of changes or which elements have changed. Then you can simply set a shared memory location to zero at first and let any thread in the block set it to 1, at any time, in any order. A similar approach can be used for the 2nd step global reduction. — Robert Crovella
Can you explain exactly what the actual stability or convergence criteria you need is? — talonmies
@talonmies No modifications to the matrix in the last iteration. — user2398029
@RobertCrovella similar idea to 3), which is what I'm doing right now (albeit directly on the matrix instead of on a "flags" matrix). I just can't wrap my head around the fact that such a "complex" process is needed when in the best case scenario a boolean flag would have to be modified once. — user2398029
@louism: I had actually written you an answer for this some days weeks ago including some demo code, but a browser crash ate it, sorry. I would do a sum-reduce, but you can do it efficiently using warp voting primitives (the __any() warp vote for example). Then you only need a very simple reduction for the result of each warp within a block, and a single atomic add per block to update a global flag. If the flag is in zero copy memory, then you don't need an explicit copy to inspect the result on the host. — talonmies

talonmies talonmies · Accepted Answer · 2014-05-29T19:22:11

I'll also go back to the answer I would have posted in 2012 but for a browser crash.

The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.

This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.

A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:

#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>

__device__ unsigned int process(int & val)
{
    return (++val < 10);
}

template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
    __shared__ int wchanged[nwarps];
    unsigned int laneid = threadIdx.x % warpSize;
    unsigned int warpid = threadIdx.x / warpSize;

    // Do calculations then check for change/convergence 
    // and set tchanged to be !=0 if required
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int tchanged = process(inout[idx]);

    // Simple blockwise reduction using voting primitives
    // increments kchanged is any thread in the block 
    // returned tchanged != 0
    tchanged = __any(tchanged != 0);
    if (laneid == 0) {
        wchanged[warpid] = tchanged;
    }
    __syncthreads();

    if (threadIdx.x == 0) {
        int bchanged = 0;
#pragma unroll
        for(int i=0; i<nwarps; i++) {
            bchanged |= wchanged[i];
        }
        if (bchanged) {
            atomicAdd(kchanged, 1);
        }
    }
}

int main(void)
{
    const int N = 2048;
    const int min = 5, max = 15;
    std::vector<int> data(N);
    for(int i=0; i<N; i++) {
        data[i] = min + (std::rand() % (int)(max - min + 1));
    }

    int* _data;
    size_t datasz = sizeof(int) * (size_t)N;
    cudaMalloc<int>(&_data, datasz);
    cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);

    unsigned int *kchanged, *_kchanged;
    cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
    cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);

    const int nwarps = 4;
    dim3 blcksz(32*nwarps), grdsz(16);

    // Loop while the kernel signals it needs to run again
    do {
        *kchanged = 0;
        kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
        cudaDeviceSynchronize(); 
    } while (*kchanged != 0); 

    cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
    cudaDeviceReset();

    int minval = *std::min_element(data.begin(), data.end());
    assert(minval == 10);

    return 0;
}

Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.

You should be able to adapt this to whatever your application requires

Efficient method to check for matrix stability in CUDA

2 Answers