I'm trying to write a stream compaction (take an array and get rid of empty elements) with SIMD intrinsics. Each iteration of the loop processes 8 elements at a time (SIMD width).
With SSE intrinsics, I can do this fairly efficiently with _mm_shuffle_epi8(), which does a 16 entry table lookup (gather in parallel computing terminology). The shuffle indices are precomputed, and looked up with a bit mask.
for (i = 0; i < n; i += 8)
{
v8n_Data = _mm_load_si128(&data[i]);
mask = _mm_movemask_epi8(&is_valid[i]) & 0xff; // is_valid is byte array
v8n_Compacted = _mm_shuffle_epi8(v16n_ShuffleIndices[mask]);
_mm_storeu_si128(&compacted[count], v8n_Compacted);
count += bitCount[mask];
}
My problem is now I would like to implement this for Altivec SIMD too (don't ask why - misguided business decision). Altivec doesn't have an equivalent for _mm_movemask_epi8(), a critical ingredient. So, I will need to find a way to either
emulate _mm_movemask_epi8() - seems expensive, several shifts and ORs
directly generate the shuffle indices efficiently -
namely, index i will be the index of the ith valid element in the uncompacted data
element_valid: 0 0 1 0 1 0 0 1 0
gather_indices: x x x x x x 6 4 1
scatter_indices: 3 3 2 2 1 1 1 0 0
It's simple to do this serially, but I need it to be parallel (SIMD). It seems easy to generate scatter indices with a prefix sum, but since neither AltiVec nor SSE has a scatter instruction, I need gather indices instead. Gather indices are the inverse function of the scatter indices, but how can that be gotten in parallel? I know in the pioneering days of GPU programming, converting scatters to gathers was a common technique, but none of those 2 described methods seem practical.
Maybe if not insisting the compaction preserves the element order will allow more efficient implementation? I can give that up.