I'm developing a bioinformatics tool and I'm trying to use SIMD to boost its speed.
Given two char arrays of length 16, I need to rapidly count the number of indices at which the strings match. For example, the two following strings, "TTTTTTTTTTTTTTTT" and "AAAAGGGGTTTTCCCC", match from 9th through 12th positions ("TTTT"), and therefore the output should be 4.
As shown in the following function foo (which works fine but slow), I packed each characters in seq1 and seq2 into __m128i variables s1 and s2, and used _mm_cmpeq_epi8 to compare every position simultaneously. Then, using popcnt128 (from Fast counting the number of set bits in __m128i register by Marat Dukhan) to add up the number of matching bits.
float foo(char* seq1, char* seq2) {
__m128i s1, s2, ceq;
int match;
s1 = _mm_load_si128((__m128i*)(seq1));
s2 = _mm_load_si128((__m128i*)(seq2));
ceq = _mm_cmpeq_epi8(s1, s2);
match = (popcnt128(ceq)/8);
return match;
}
Although popcnt128 by Marat Dukhan is a lot faster than naïvely adding up every bit in __m128i, __popcnt128() is the slowest bottleneck in the function, taking up about 80% of the computational speed. So, I would like to come up with an alternative to popcnt128.
I tried to interpret __m128i ceq
as a string and to use it as a key for a precomputed look-up table that maps a string to the total number of bits. If char array were hashable, I could do something like
union{__m128i ceq; char c_arr[16];}
match = table[c_arr] // table = unordered map
If I try to do something similar for strings (i.e. union{__m128i ceq; string s;};
), I get the following error message "::()’ is implicitly deleted because the default definition would be ill-formed". When I tried other things, I ran into segmentation faults.
Is there any way I can tell the compiler to read __m128i as string so I can directly use __m128i as a key for unordered_map? I don't see why it shouldn't work because string is a contiguous array of chars, which can be naturally represented by __m128i. But I couldn't get it to work and unable to find any solution online.
popcnt128
function. Beside this accessingc_arr
the way you do produce an ill-formed program leading to an undefined behaviour in C++ (usememcpy
instead if you need that). – Jérôme Richardpopcnt128
does much more than you actually need. Just do__builtin_popcnt(_mm_movemask_epi8(cnt))
. – chtzpsadbw
against 0 (hsum of bytes on the compare result), but that needs a final hsum step of qword halves, so that's going to be worse if HW popcnt is available, but worth considering if you need baseline x86-64 with just SSE2. (Also you need to negate before psadbw, or scale the sum down by 1/255). – Peter Cordesstd::unordered_map<char[16]>
or something, not a flat array, so it could be possible; there are "only" 2^16 entries needed. But of course is horribly slow compared to movemask / popcnt. The amount of work required just to compute a hash function of a 16-byte char array is comparable to just getting the answer. :P – Peter Cordes