Cuda efficiently copy from byte array to shared memory elements of different sizes

Question

I have a byte array (unsigned char *) that represents a tree like datastructure in memory. Each node of the tree contains elements of different sizes: 1 bool at the beginning, n unsigned ints and n unsigned shorts. I have done it in this way because having the least memory usage is very important for me. Unfortunately that leads to memory alignment issues when I try to access to copy from global memory into shared memory:

__global__ void sampleerror(unsigned char * global_mem, unsigned int updated_idx...) {
    __shared__ unsigned int offsets[MAX_NUM_CHILDREN/2 +1]; 
    __shared__ unsigned int entries[ENTRIES_PER_NODE];
    __shared__ bool booleans[4];
    bool * is_last = &booleans[0];
    //First warp divergence here. We are reading in from global memory
    if (i == 0) {
        *is_last = (bool)global_mem[updated_idx];
    }
    __syncthreads();

    if (*is_last) {
        //The number of entries in the bottom most nodes may be smaller than the size
        if (i < (size - 1)/entry_size) {
            entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + i*sizeof(unsigned int)]);
        }
    } else {
        int num_entries = (size - 1 - sizeof(unsigned int) - sizeof(unsigned short))/(entry_size + sizeof(unsigned short));
        //Load the unsigned int start offset together with the accumulated offsets to avoid warp divergence
        if (i < ((num_entries + 1)/2) + 1) {
            offsets[i] = *(unsigned int *)(&global_mem[updated_idx + 1 * i*sizeof(unsigned int)]);
        }
        __syncthreads();
        //Now load the entries
        if (i < num_entries) {
            entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + (num_entries + 1)*sizeof(unsigned int) + i*sizeof(unsigned int)]);
        }
    }
    __syncthreads();
}

I get misaligned memory access, because I am trying to copy to shared memory here (and in the else statement):

entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + i*sizeof(unsigned int)]);

because updated_idx + 1 is not necessarely aligned. Questions:

1) If I don't want to pad my datastructure to be nicely aligned for integers, is copying byte by byte my only option?

2) If I am copying byte by byte to shared memory from global, would it be 4 times slower than if I am able to copy unsigned int by unsigned int.

3) Is it possible to get misaligned memory accesses if I am doing it byte by byte? I think i've read that byte accesses are always aligned.

EDIT:

I have a btree-ish datastructure where each node contains a payload in the form of:

struct Entry {
    unsigned int key;
    unsigned int next_level_offset;
    float prob1;
    float prob2;
 }

For searching the btree I only require the key information from each entry, not the rest of the information in the struct. Therefore each node is collapsed in a byte array in the following way:

(bool is_last)(key1, key2, key3...)((offset, prob1 prob2 of key1), (offset, prob1 prob2 of key2), (offset, prob1 prob2 of key3))(unsigned int first_child_start_offset)(short sizeofChild1, short sizeofChild2, short sizeofChild3...)

Obviously if is_last is false, than there would be only no childrenOffsets stored.

The reason i have laid out the data in this way is that the number of entries per node can be variable so if I store separate things in separate arrays I would have to keep additional track of beginning and end indices of those "metadata" arrays which would result in either more data being stored or having to use a state machine during search, which I would like to avoid. I believe it can be done with relatively little work for the bool part of each node, but not for anything else (like the offsets).

Would it be too hard to post and actual compilable, minimal version of your code? It is pretty hard to analyse your kernel when a number of variables which influence the read patterns of the code are undefined. — talonmies
I could post a compilable version of my code but you require a byte array with the actual datastructure and the way to construct it, which i can't easily post. What is not clear from my code? — XapaJIaMnu
The actual size of the entries, for starters. Where does entry_size come from? Is updated_idx read from global memory or computed from block and thread indices. You haven't explained why it is even necessary to use an AOS anyway. Why not use an SOA approach? — talonmies
I have updated my post with the information you required. If it is not clear and you need more, please ask. — XapaJIaMnu

talonmies talonmies · Accepted Answer · 2015-07-20T10:30:34

If I don't want to pad my datastructure to be nicely aligned for integers, is copying byte by byte my only option?

Looking at what code you have provided, I would say more or less, yes. You might want to use memcpy. The compiler will emit pretty optimal byte copy loop by doing so. You might also want to investigate changed the ptxas default cache behaviour for loads to bypass L1 cache (so the -Xptxas="--def-load-cache=cg" option). It may give better performance.
If I am copying byte by byte to shared memory from global, would it be 4 times slower than if I am able to copy unsigned int by unsigned int.

You should expect memory throughput to be reduced. How much is hard to say without benchmarking. That's your job, if you are so inclined
Is it possible to get misaligned memory accesses if I am doing it byte by byte? I think i've read that byte accesses are always aligned.

The alignment criteria is always word size. So single byte words are always aligned. But keep in mind that if you perform bytes loads to a shared memory buffer and then try and use reinterpret_cast to read out larger words sizes which are not aligned to the shared byte array, you have the same problem.

You haven't given much detail about the size of a given subtree. There might be some template tricks you could use to expand a priori known size byte loads into a series of 32 bit char4 loads with 1 to 3 trailing byte loads to get a give byte buffer size into memory. That should be more performant, if it fits into your data structure design.

Cuda efficiently copy from byte array to shared memory elements of different sizes

1 Answers