How to pack bits (efficiently) in CUDA?

Question

I have an array of bytes where each byte is either 0 or 1. Now I want to pack these values into bits, so that 8 original bytes occupy 1 target byte, with original byte 0 going into bit 0, byte 1 into bit 1, etc. So far I have the following in the kernel:

const uint16_t tid = threadIdx.x;
__shared__ uint8_t packing[cBlockSize];

// ... Computation of the original bytes in packing[tid]
__syncthreads();

if ((tid & 4) == 0)
{
    packing[tid] |= packing[tid | 4] << 4;
}
if ((tid & 6) == 0)
{
    packing[tid] |= packing[tid | 2] << 2;
}
if ((tid & 7) == 0)
{
    pOutput[(tid + blockDim.x*blockIdx.x)>>3] = packing[tid] | (packing[tid | 1] << 1);
}

Is this correct and efficient?

This can't ever work. That is a memory race. There is not such thing as parallel bit sized transactions in CUDA — talonmies
@talonmies, I thought there is no race because threads handling the same byte belong to the same warp. — Serge Rogatch
Being in the same warp is no guarantee of safety, No two threads can modify the same byte simultaneously without causing a race — talonmies
I meant that they handle the same byte sequentially between ifs, but they don't modify the same byte in parallel. — Serge Rogatch

tera tera · Accepted Answer · 2016-09-14T10:57:20

The __ballot() warp-voting function comes quite handy for this. Assuming that you can redefine pOutput to be of uint32_t type, and that your block size is a multiple of the warp size (32):

unsigned int target = __ballot(packing[tid]);
if (tid % warpSize == 0) {
    pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
}

Strictly speaking, the if conditional isn't even necessary, as all threads of the warp will write the same data to the same address. So a highly optimized version would just be

pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = __ballot(packing[tid]);

How to pack bits (efficiently) in CUDA?

2 Answers