I made some more improvements to Mike's good idea and code.
I also made a vectorized version (requiring SSE4.1). It's more likely to have bugs, but is worth trying because you should get a significant speedup from doing packed multiplies. Porting it to AVX should give another big speedup.
See all the code on godbolt, including a vectorized conversion from ASCII to 0..3 bases (using a pshufb LUT).
My changes:
Don't translate ahead of time. It should overlap well with the FP loop's work, instead of forcing that to wait for a tiny translation loop to finish before the FP work can start.
Simplify the counter variables (gcc makes better code: it was actually keeping j
in a register, rather than optimizing it away. Or else it was fully unrolling the inner loop into a giant loop.)
Pull the scaling by (count + 4*alpha)
out of the loop entirely: instead of dividing by (or multiplying by the inverse), subtract the logarithm. Since log() grows extremely slowly, we can probably defer this indefinitely without losing too much precision in the final score
.
An alternative would be only subtract every N iterations, but then the loop would have to figure out whether it terminated early. At the very least, we could multiply by 1.0 / (count + 4*alpha)
, instead of dividing. Without -ffast-math
, the compiler can't do this for you.
Have the caller calculate pwcluster
for us: it probably calculates it for its own use anyway, and we can remove one of the function args (cluster
).
row_offset
was making slightly worse code compared to just writing i*cols
. If you like pointer increments as an alternative to array indexing, gcc makes even better code in the inner loop incrementing pwcluster
directly.
rename l
to len
: single-letter variable names are bad style except in very small scopes. (like a loop, or a very small function that only does one thing), and even then only if there isn't a good short but meaningful name. e.g. p
is no more meaningful than ptr
, but len
does tell you about what it means, not just what it is.
Further observations:
Storing sequences in translated format throughout your program would be better for this and any other code that wants to use the DNA base as an array index or counter.
You can also vectorize translating nucleotide numbers (0..3) to/from ASCII using SSSE3 pshufb. (See my code on godbolt).
Storing your matrix in float
instead of int
might possibly be better. Since your code spends most of its time in this function now, it would run faster if it didn't have to keep converting from int to float. On Haswell, cvtss2sd
(single->double) apparently has better throughput than ctvsi2sd
(int->double), but not on Skylake. (ss2sd is slower on SKL than HSW).
Storing your matrix in double
format might be faster, but the doubled cache footprint might be killer. Doing this calculation with float
instead of double
would avoid the conversion cost, too. But you could defer log()
for more iterations with double
.
Using multiple product
variables (p1
, p2
, etc.) in a manually unrolled inner loop would expose more parallelism. Multiply them together at the end of the loop. (I ended up making a vectorized version with two vector accumulators).
For Skylake or maybe Broadwell, you could vectorize with VPGATHERDD
. The vectorized translation from ASCII to 0..3 will be helpful here.
Even without using a gather instruction, loading two integers into a vector and using a packed conversion instruction would be good. The packed conversion instructions are faster than the scalar ones. We have a lot of multiplies to do, and can certainly take advantage of doing two or four at once with SIMD vectors. See below.
Simple version with my improvements:
See the full code on godbolt, link at the top of this answer.
double cal_score_simple(
int len // one-letter variable names are only good in the smallest scopes, like a loop
, const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value
, const int *__restrict__ pwcluster // have the caller do the address math for us, since it probably already does it anyway
, double alpha )
{
// note that __restrict__ isn't needed because we don't write into any pointers
const int cols = 4;
const int logdelay_factor = 4; // accumulate products for this many iterations before doing a log()
int count=0; // count the first row of pwcluster
for(int k = 0; k < cols; k++)
count += pwcluster[k];
const double log_c4a = log(count + 4*alpha);
double score = 0;
for (int i = 0; i < len;){
double product = 1;
int inner_bound = std::min(len, i+logdelay_factor);
while (i < inner_bound){
unsigned int k = transTable[seq[i]]; // translate on the fly
product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred
// TODO: unroll this with two or four product accumulators to allow parallelism
i++;
}
score += log(product); // - log_c4a * j;
}
score -= log_c4a * len; // might be ok to defer this subtraction indefinitely, since log() accumulates very slowly
return score;
}
This compiles to quite good asm, with a pretty compact inner loop:
.L6:
movzx esi, BYTE PTR [rcx] # D.74129, MEM[base: _127, offset: 0B]
vxorpd xmm1, xmm1, xmm1 # D.74130
add rcx, 1 # ivtmp.44,
movzx esi, BYTE PTR transTable[rsi] # k, transTable
add esi, eax # D.74133, ivtmp.45
add eax, 4 # ivtmp.45,
vcvtsi2sd xmm1, xmm1, DWORD PTR [r12+rsi*4] # D.74130, D.74130, *_38
vaddsd xmm1, xmm1, xmm2 # D.74130, D.74130, alpha
vmulsd xmm0, xmm0, xmm1 # product, product, D.74130
cmp eax, r8d # ivtmp.45, D.74132
jne .L6 #,
Using a pointer increment instead of indexing with i*cols
removes one add
from the loop, getting it down to 10 fused-domain uops (vs. 11 in this loop). So it won't matter for frontend throughput from the loop buffer, but will be fewer uops for the execution ports. Resource stalls can make that matter, even when total uop throughput isn't the immediate bottleneck.
manually vectorized SSE version:
Not tested, and not that carefully written. I could easily have made some mistakes here. If you're running this on computers with AVX, you should definitely make an AVX version of this. Use vextractf128
as a first step in a horizontal product or sum, then the same as what I have here.
With a vectorized log()
function to compute two (or four with AVX) log()
results in parallel in a vector, you could just do a horizontal sum at the end, instead of more frequent horizontal products before each scalar log()
. I'm sure someone's written one, but I'm not going to take the time to search for it right now.
// TODO: AVX version
double cal_score_SSE(
int len // one-letter variable names are only good in the smallest scopes, like a loop
, const unsigned char* seq // if you want to pass a std::string, do it by const &, not by value
, const int *__restrict__ pwcluster // have the caller do the address math for us, since it probably already does it anyway
, double alpha
)
{
const int cols = 4;
const int logdelay_factor = 16; // accumulate products for this many iterations before doing a log()
int count=0; // count the first row of pwcluster
for(int k = 0; k < cols; k++) count += pwcluster[k];
//const double count4alpha_inverse = 1.0 / (count + 4*alpha);
const double log_c4a = log(count + 4*alpha);
#define COUNTER_TYPE int
//// HELPER FUNCTION: make a vector of two (pwcluster[i*cols + k] + alpha)
auto lookup_two_doublevec = [&pwcluster, &seq, &alpha](COUNTER_TYPE pos) {
unsigned int k0 = transTable[seq[pos]];
unsigned int k1 = transTable[seq[pos+1]];
__m128i pwvec = _mm_cvtsi32_si128( pwcluster[cols*pos + k0] );
pwvec = _mm_insert_epi32(pwvec, pwcluster[cols*(pos+1) + k1], 1);
// for AVX: repeat the previous lines, and _mm_unpack_epi32 into one __m128i,
// then use _mm256_cvtepi32_pd (__m128i src)
__m128d alphavec = _mm_set1_pd(alpha);
return _mm_cvtepi32_pd(pwvec) + alphavec;
//p1d = _mm_add_pd(p1d, _mm_set1_pd(alpha));
};
double score = 0;
for (COUNTER_TYPE i = 0; i < len;){
double product = 1;
COUNTER_TYPE inner_bound = i+logdelay_factor;
if (inner_bound >= len) inner_bound = len;
// possibly do a whole vector of transTable translations; probably doesn't matter
if (likely(inner_bound < len)) {
// We can do 8 or 16 elements without checking the loop counter
__m128d p1d = lookup_two_doublevec(i+0);
__m128d p2d = lookup_two_doublevec(i+2);
i+=4; // start with four element loaded into two vectors, not multiplied by anything
static_assert(logdelay_factor % 4 == 0, "logdelay_factor must be a multiple of 4 for vectorization");
while (i < inner_bound) {
// The *= syntax requires GNU C vector extensions, which is how __m128d is defined in gcc
p1d *= lookup_two_doublevec(i+0);
p2d *= lookup_two_doublevec(i+2);
i+=4;
}
// we have two vector accumulators, holding two products each
p1d *= p2d; // combine to one vector
//p2d = _mm_permute_pd(p1d, 1); // if you have AVX. It's no better than movhlps, though.
// movhlps p2d, p1d // extract the high double, using p2d as a temporary
p2d = _mm_castps_pd( _mm_movehl_ps(_mm_castpd_ps(p2d), _mm_castpd_ps(p1d) ) );
p1d = _mm_mul_sd(p1d, p2d); // multiply the last two elements, now that we have them extracted to separate vectors
product = _mm_cvtsd_f64(p1d);
// TODO: find a vectorized log() function for use here, and do a horizontal add down to a scalar outside the outer loop.
} else {
// Scalar for the last unknown number of iterations
while (i < inner_bound){
unsigned int k = transTable[seq[i]];
product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred
i++;
}
}
score += log(product); // - log_c4a * j; // deferred
}
score -= log_c4a * len; // May be ok to defer this subtraction indefinitely, since log() accumulates very slowly
// if not, subtract log_c4a * logdefer_factor in the vector part,
// and (len&15)*log_c4a out here at the end. (i.e. len %16)
return score;
}
Vectorized ASCII->integer DNA base
Ideally, do the translation once when you read in sequences, and internally store them in 0/1/2/3 arrays instead of A/C/G/T ASCII strings.
It can be manually vectorized with pshufb, if we don't have to check for errors (invalid characters). In Mike's code, where we translate the whole input ahead of the FP loop, this can give a big speedup to that part of the code.
For translating on the fly, we could use a vector to:
- translate a block of 16 input characters in the outer loop,
- store it to a 16 byte buffer,
- then do scalar loads from that in the inner loop.
Since gcc seems to fully unroll the vector loop, this would replace 16 movzx
instructions with 6 vector instructions (including the load and store).
#include <immintrin.h>
__m128i nucleotide_ASCII_to_number(__m128i input) {
// map A->0, C->1, G->2, T->3.
// low 4 bits aren't unique low 4 bits *are* unique
/* 'A' = 65 = 0b100 0001 >>1 : 0b10 0000
* 'C' = 67 = 0b100 0011 >>1 : 0b10 0001
* 'G' = 71 = 0b100 0111 >>1 : 0b10 0011
* 'T' = 87 = 0b101 0111 >>1 : 0b10 1011 // same low 4 bits for lower-case
*
* We right-shift by one, mask, and use that as indices into a LUT
* We can use pshufb as a 4bit LUT, to map all 16 chars in parallel
*/
__m128i LUT = _mm_set_epi8(0xff, 0xff, 0xff, 0xff, 3, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff, 2, 0xff, 1, 0);
// Not all "bogus" characters map to 0xFF, but 0xFF in the output only happens on invalid input
__m128i shifted = _mm_srli_epi32(input, 1); // And then mask, to emulate srli_epi8
__m128i masked = _mm_and_si128(shifted, _mm_set1_epi8(0x0F));
__m128i nucleotide_codes = _mm_shuffle_epi8(LUT, masked);
return nucleotide_codes;
}
// compiles to:
vmovdqa xmm1, XMMWORD PTR .LC2[rip] # the lookup table
vpsrld xmm0, xmm0, 1 # tmp96, input,
vpand xmm0, xmm0, XMMWORD PTR .LC1[rip] # D.74111, tmp96,
vpshufb xmm0, xmm1, xmm0 # tmp100, tmp101, D.74111
ret
log
. You probably can't just do it at the end, but you might do it every 10. The division bycount+4*alpha
can also be done outside (by subtracting its log). Also, you can pre-processseq
from lettersA..T
into integers0..3
. Then your inner loop doesn't have to be a loop at all! – Mike Dunlavey