I have a byte array (unsigned char *) that represents a tree like datastructure in memory. Each node of the tree contains elements of different sizes: 1 bool at the beginning, n unsigned ints and n unsigned shorts. I have done it in this way because having the least memory usage is very important for me. Unfortunately that leads to memory alignment issues when I try to access to copy from global memory into shared memory:
__global__ void sampleerror(unsigned char * global_mem, unsigned int updated_idx...) {
__shared__ unsigned int offsets[MAX_NUM_CHILDREN/2 +1];
__shared__ unsigned int entries[ENTRIES_PER_NODE];
__shared__ bool booleans[4];
bool * is_last = &booleans[0];
//First warp divergence here. We are reading in from global memory
if (i == 0) {
*is_last = (bool)global_mem[updated_idx];
}
__syncthreads();
if (*is_last) {
//The number of entries in the bottom most nodes may be smaller than the size
if (i < (size - 1)/entry_size) {
entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + i*sizeof(unsigned int)]);
}
} else {
int num_entries = (size - 1 - sizeof(unsigned int) - sizeof(unsigned short))/(entry_size + sizeof(unsigned short));
//Load the unsigned int start offset together with the accumulated offsets to avoid warp divergence
if (i < ((num_entries + 1)/2) + 1) {
offsets[i] = *(unsigned int *)(&global_mem[updated_idx + 1 * i*sizeof(unsigned int)]);
}
__syncthreads();
//Now load the entries
if (i < num_entries) {
entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + (num_entries + 1)*sizeof(unsigned int) + i*sizeof(unsigned int)]);
}
}
__syncthreads();
}
I get misaligned memory access, because I am trying to copy to shared memory here (and in the else statement):
entries[i] = *(unsigned int *)(&global_mem[updated_idx + 1 + i*sizeof(unsigned int)]);
because updated_idx + 1 is not necessarely aligned. Questions:
1) If I don't want to pad my datastructure to be nicely aligned for integers, is copying byte by byte my only option?
2) If I am copying byte by byte to shared memory from global, would it be 4 times slower than if I am able to copy unsigned int by unsigned int.
3) Is it possible to get misaligned memory accesses if I am doing it byte by byte? I think i've read that byte accesses are always aligned.
EDIT:
I have a btree-ish datastructure where each node contains a payload in the form of:
struct Entry {
unsigned int key;
unsigned int next_level_offset;
float prob1;
float prob2;
}
For searching the btree I only require the key information from each entry, not the rest of the information in the struct. Therefore each node is collapsed in a byte array in the following way:
(bool is_last)(key1, key2, key3...)((offset, prob1 prob2 of key1), (offset, prob1 prob2 of key2), (offset, prob1 prob2 of key3))(unsigned int first_child_start_offset)(short sizeofChild1, short sizeofChild2, short sizeofChild3...)
Obviously if is_last is false, than there would be only no childrenOffsets stored.
The reason i have laid out the data in this way is that the number of entries per node can be variable so if I store separate things in separate arrays I would have to keep additional track of beginning and end indices of those "metadata" arrays which would result in either more data being stored or having to use a state machine during search, which I would like to avoid. I believe it can be done with relatively little work for the bool part of each node, but not for anything else (like the offsets).