2
votes

This is my second question of previous one
Faster way to do multi dimensional matrix addition? After follow the advice of @Peter Cordes i vectorize my code and now the speed has been up by 50X. Then i again did the gprof and found this function is taking most of the time.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 69.97      1.53     1.53                             cal_score(int, std::string, int const*, int, double)
double cal_score(int l, string seq, const int *__restrict__ pw,int cluster,double alpha)
{
  const int cols =4;
  const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
  double score = 0;
  char s;
  string alphabet="ACGT";   
  int count=0;
  for(int k=0;k<cols;k++)           
    count=count+pwcluster[k];

  for (int i = 0; i < l; i++){
    long row_offset = cols*i;
    s=seq[i];
    //#pragma omp simd 
    for(int k=0;k<cols;k++) {
            if (s==alphabet[k])
                score=score+log(    ( pwcluster[row_offset+k]+alpha )/(count+4*alpha)       );
    }
  }
  return score;
}

I am doing the code optimization first time so don't know how to proceed. So is there any way to write better this function. So i can get some more speed. Input seq is the sequence of character 'ACGT' of length l. pw is one dimensional array of size 2*l*4 or [p][q][r] and cluster is p.

2
alpha is double then everything is promoted to double, and log is usually slow. you can actually try keeping score as a product and do only one log at the very end, that saves you a lot of logs.user3528438
Probably won't help much but you can tweak the line of code that runs the most, which is the score=score+ ... in the innermost loop. The expression (count+4*alpha) can be pre-computed. Also I'm not sure why row_offset is a long -- is pwcluster bigger than 2GB? Can you index that high? You might save a tiny bit of time by eliminating that mixed-mode arithmetic.nicomp
By all means, try to minimize calls to log. You probably can't just do it at the end, but you might do it every 10. The division by count+4*alpha can also be done outside (by subtracting its log). Also, you can pre-process seq from letters A..T into integers 0..3. Then your inner loop doesn't have to be a loop at all!Mike Dunlavey
Thanks a lot @MikeDunlavey Its workingankit agrawal
50x speedup, nice! The original code had an impressive amount of things to make it slow, but still. I used to work for the phylogenetics group at Dalhousie university (in Canada), maybe you know some people there since you're also doing maximum-likelihood stuff on DNA sequences.Peter Cordes

2 Answers

1
votes

Here's another way to rewrite it. This translates the string with a lookup table instead of a search, and calls log fewer times by a factor of 10.

This also changes seq to a const char* passed by reference, instead of a std::string passed by value. (That would copy the whole string).

unsigned char transTable[128];

void InitTransTable(){
  memset(transTable, 0, sizeof(transTable));
  transTable['A'] = 0;
  transTable['C'] = 1;
  transTable['G'] = 2;
  transTable['T'] = 3;
}

static int tslen = 0;                // static instead of global lets the compiler keep tseq in a register inside the loop
static unsigned char* tseq = NULL;   // reusable buffer for translations.  Not thread-safe

double cal_score(
    int l
  , const unsigned char* seq         // if you want to pass a std::string, do it by const &, not by value
  , const int *__restrict__ pw
  , int cluster
  , double alpha
  )
{
  int i, j, k;
  // make sure tseq is big enough
  if (tseq == NULL){
    tslen = std::max(4096, l+1024);
    tseq = new unsigned char[tslen];
    memset(tseq, 0, tslen);
  } else if (l > tslen-1){
    delete tseq;
    tslen = l + 4096;
    tseq = new unsigned char[tslen];
    memset(tseq, 0, tslen);
  }
  // translate seq into tseq
  // (decrementing i so the beginning of tseq will be hot in cache when we're done)
  for (i = l; --i >= 0;) tseq[i] = transTable[seq[i]];

  const int cols = 4;
  const int *__restrict__ pwcluster = pw + ((long)cluster) * l * cols;
  double score = 0;
  // count up pwcluster
  int count=0;
  for(k = 0; k < cols; k++) count += pwcluster[k];

  double count4alpha = (count + 4*alpha);
  long row_offset = 0;
  for (i = 0; i < l;){
    double product = 1;
    for (j = 0; j < 10 && i < l; j++, i++, row_offset += cols){
      k = tseq[i];
      product *= (pwcluster[row_offset + k] + alpha) / count4alpha;
    }
    score += log(product);
  }
  return score;
}

This compiles to fairly good code, but without -ffast-math the division can't be replaced by a multiplication.

It doesn't auto-vectorize, because we only load one of every four elements of pwcluster.

1
votes

I made some more improvements to Mike's good idea and code.

I also made a vectorized version (requiring SSE4.1). It's more likely to have bugs, but is worth trying because you should get a significant speedup from doing packed multiplies. Porting it to AVX should give another big speedup.

See all the code on godbolt, including a vectorized conversion from ASCII to 0..3 bases (using a pshufb LUT).


My changes:

  • Don't translate ahead of time. It should overlap well with the FP loop's work, instead of forcing that to wait for a tiny translation loop to finish before the FP work can start.

  • Simplify the counter variables (gcc makes better code: it was actually keeping j in a register, rather than optimizing it away. Or else it was fully unrolling the inner loop into a giant loop.)

  • Pull the scaling by (count + 4*alpha) out of the loop entirely: instead of dividing by (or multiplying by the inverse), subtract the logarithm. Since log() grows extremely slowly, we can probably defer this indefinitely without losing too much precision in the final score.

    An alternative would be only subtract every N iterations, but then the loop would have to figure out whether it terminated early. At the very least, we could multiply by 1.0 / (count + 4*alpha), instead of dividing. Without -ffast-math, the compiler can't do this for you.

  • Have the caller calculate pwcluster for us: it probably calculates it for its own use anyway, and we can remove one of the function args (cluster).

  • row_offset was making slightly worse code compared to just writing i*cols. If you like pointer increments as an alternative to array indexing, gcc makes even better code in the inner loop incrementing pwcluster directly.

  • rename l to len: single-letter variable names are bad style except in very small scopes. (like a loop, or a very small function that only does one thing), and even then only if there isn't a good short but meaningful name. e.g. p is no more meaningful than ptr, but len does tell you about what it means, not just what it is.


Further observations:

  • Storing sequences in translated format throughout your program would be better for this and any other code that wants to use the DNA base as an array index or counter.

    You can also vectorize translating nucleotide numbers (0..3) to/from ASCII using SSSE3 pshufb. (See my code on godbolt).

  • Storing your matrix in float instead of int might possibly be better. Since your code spends most of its time in this function now, it would run faster if it didn't have to keep converting from int to float. On Haswell, cvtss2sd (single->double) apparently has better throughput than ctvsi2sd (int->double), but not on Skylake. (ss2sd is slower on SKL than HSW).

    Storing your matrix in double format might be faster, but the doubled cache footprint might be killer. Doing this calculation with float instead of double would avoid the conversion cost, too. But you could defer log() for more iterations with double.

  • Using multiple product variables (p1, p2, etc.) in a manually unrolled inner loop would expose more parallelism. Multiply them together at the end of the loop. (I ended up making a vectorized version with two vector accumulators).

  • For Skylake or maybe Broadwell, you could vectorize with VPGATHERDD. The vectorized translation from ASCII to 0..3 will be helpful here.

  • Even without using a gather instruction, loading two integers into a vector and using a packed conversion instruction would be good. The packed conversion instructions are faster than the scalar ones. We have a lot of multiplies to do, and can certainly take advantage of doing two or four at once with SIMD vectors. See below.


Simple version with my improvements:

See the full code on godbolt, link at the top of this answer.

double cal_score_simple(
    int len                            // one-letter variable names are only good in the smallest scopes, like a loop
  , const unsigned char* seq           // if you want to pass a std::string, do it by const &, not by value
  , const int *__restrict__ pwcluster  // have the caller do the address math for us, since it probably already does it anyway
  , double alpha )
{
  // note that __restrict__ isn't needed because we don't write into any pointers
  const int cols = 4;
  const int logdelay_factor = 4;  // accumulate products for this many iterations before doing a log()

  int count=0;    // count the first row of pwcluster
  for(int k = 0; k < cols; k++)
    count += pwcluster[k];

  const double log_c4a = log(count + 4*alpha);

  double score = 0;
  for (int i = 0; i < len;){
    double product = 1;
    int inner_bound = std::min(len, i+logdelay_factor);

    while (i < inner_bound){
      unsigned int k = transTable[seq[i]];        // translate on the fly
      product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred
      // TODO: unroll this with two or four product accumulators to allow parallelism
      i++;
    }

    score += log(product);  // - log_c4a * j;
  }

  score -= log_c4a * len;   // might be ok to defer this subtraction indefinitely, since log() accumulates very slowly
  return score;
}

This compiles to quite good asm, with a pretty compact inner loop:

.L6:
    movzx   esi, BYTE PTR [rcx]   # D.74129, MEM[base: _127, offset: 0B]
    vxorpd  xmm1, xmm1, xmm1    # D.74130
    add     rcx, 1    # ivtmp.44,
    movzx   esi, BYTE PTR transTable[rsi] # k, transTable
    add     esi, eax  # D.74133, ivtmp.45
    add     eax, 4    # ivtmp.45,
    vcvtsi2sd       xmm1, xmm1, DWORD PTR [r12+rsi*4]     # D.74130, D.74130, *_38
    vaddsd  xmm1, xmm1, xmm2    # D.74130, D.74130, alpha
    vmulsd  xmm0, xmm0, xmm1    # product, product, D.74130
    cmp     eax, r8d  # ivtmp.45, D.74132
    jne     .L6       #,

Using a pointer increment instead of indexing with i*cols removes one add from the loop, getting it down to 10 fused-domain uops (vs. 11 in this loop). So it won't matter for frontend throughput from the loop buffer, but will be fewer uops for the execution ports. Resource stalls can make that matter, even when total uop throughput isn't the immediate bottleneck.


manually vectorized SSE version:

Not tested, and not that carefully written. I could easily have made some mistakes here. If you're running this on computers with AVX, you should definitely make an AVX version of this. Use vextractf128 as a first step in a horizontal product or sum, then the same as what I have here.

With a vectorized log() function to compute two (or four with AVX) log() results in parallel in a vector, you could just do a horizontal sum at the end, instead of more frequent horizontal products before each scalar log(). I'm sure someone's written one, but I'm not going to take the time to search for it right now.

// TODO: AVX version
double cal_score_SSE(
    int len                            // one-letter variable names are only good in the smallest scopes, like a loop
  , const unsigned char* seq           // if you want to pass a std::string, do it by const &, not by value
  , const int *__restrict__ pwcluster  // have the caller do the address math for us, since it probably already does it anyway
  , double alpha
  )
{
  const int cols = 4;
  const int logdelay_factor = 16;  // accumulate products for this many iterations before doing a log()

  int count=0;    // count the first row of pwcluster
  for(int k = 0; k < cols; k++) count += pwcluster[k];

  //const double count4alpha_inverse = 1.0 / (count + 4*alpha);
  const double log_c4a = log(count + 4*alpha);

#define COUNTER_TYPE int
  
  //// HELPER FUNCTION: make a vector of two (pwcluster[i*cols + k] + alpha)
  auto lookup_two_doublevec = [&pwcluster, &seq, &alpha](COUNTER_TYPE pos) {
        unsigned int k0 = transTable[seq[pos]];
        unsigned int k1 = transTable[seq[pos+1]];
        __m128i pwvec = _mm_cvtsi32_si128( pwcluster[cols*pos + k0] );
           pwvec = _mm_insert_epi32(pwvec, pwcluster[cols*(pos+1) + k1], 1);
        // for AVX: repeat the previous lines, and _mm_unpack_epi32 into one __m128i,
        // then use _mm256_cvtepi32_pd (__m128i src)

        __m128d alphavec = _mm_set1_pd(alpha);
        return _mm_cvtepi32_pd(pwvec) + alphavec;
        //p1d = _mm_add_pd(p1d, _mm_set1_pd(alpha));
  };
  
  double score = 0;
  for (COUNTER_TYPE i = 0; i < len;){
    double product = 1;
    COUNTER_TYPE inner_bound = i+logdelay_factor;
    if (inner_bound >= len) inner_bound = len;
    // possibly do a whole vector of transTable translations; probably doesn't matter

    if (likely(inner_bound < len)) {
      // We can do 8 or 16 elements without checking the loop counter
      __m128d p1d = lookup_two_doublevec(i+0);
      __m128d p2d = lookup_two_doublevec(i+2);

      i+=4;  // start with four element loaded into two vectors, not multiplied by anything
      static_assert(logdelay_factor % 4 == 0, "logdelay_factor must be a multiple of 4 for vectorization");

      while (i < inner_bound) {
        // The *= syntax requires GNU C vector extensions, which is how __m128d is defined in gcc
        p1d *= lookup_two_doublevec(i+0);
        p2d *= lookup_two_doublevec(i+2);
        i+=4;
      }
      // we have two vector accumulators, holding two products each
      p1d *= p2d;            // combine to one vector

      //p2d = _mm_permute_pd(p1d, 1);  // if you have AVX.  It's no better than movhlps, though.
      // movhlps  p2d, p1d   // extract the high double, using p2d as a temporary
      p2d = _mm_castps_pd( _mm_movehl_ps(_mm_castpd_ps(p2d), _mm_castpd_ps(p1d) ) );

      p1d = _mm_mul_sd(p1d, p2d);   // multiply the last two elements, now that we have them extracted to separate vectors
      product = _mm_cvtsd_f64(p1d);
      // TODO: find a vectorized log() function for use here, and do a horizontal add down to a scalar outside the outer loop.
    } else {
      // Scalar for the last unknown number of iterations
      while (i < inner_bound){
        unsigned int k = transTable[seq[i]];
        product *= (pwcluster[i*cols + k] + alpha); // * count4alpha_inverse; // scaling deferred
        i++;
      }
    }

    score += log(product);  // - log_c4a * j;  // deferred
  }

  score -= log_c4a * len;   // May be ok to defer this subtraction indefinitely, since log() accumulates very slowly
  // if not, subtract log_c4a * logdefer_factor in the vector part,
  // and (len&15)*log_c4a out here at the end.  (i.e. len %16)
  return score;
}

Vectorized ASCII->integer DNA base

Ideally, do the translation once when you read in sequences, and internally store them in 0/1/2/3 arrays instead of A/C/G/T ASCII strings.

It can be manually vectorized with pshufb, if we don't have to check for errors (invalid characters). In Mike's code, where we translate the whole input ahead of the FP loop, this can give a big speedup to that part of the code.

For translating on the fly, we could use a vector to:

  • translate a block of 16 input characters in the outer loop,
  • store it to a 16 byte buffer,
  • then do scalar loads from that in the inner loop.

Since gcc seems to fully unroll the vector loop, this would replace 16 movzx instructions with 6 vector instructions (including the load and store).

#include <immintrin.h>
__m128i nucleotide_ASCII_to_number(__m128i input) {
  
  // map A->0, C->1, G->2, T->3.
    // low 4 bits aren't unique     low 4 bits *are* unique
  /* 'A' = 65 = 0b100 0001    >>1 : 0b10 0000
   * 'C' = 67 = 0b100 0011    >>1 : 0b10 0001
   * 'G' = 71 = 0b100 0111    >>1 : 0b10 0011
   * 'T' = 87 = 0b101 0111    >>1 : 0b10 1011   // same low 4 bits for lower-case
   * 
   * We right-shift by one, mask, and use that as indices into a LUT
   * We can use pshufb as a 4bit LUT, to map all 16 chars in parallel
   */

  __m128i LUT = _mm_set_epi8(0xff, 0xff, 0xff, 0xff,   3, 0xff, 0xff, 0xff,
                             0xff, 0xff, 0xff, 0xff,   2, 0xff,    1,    0);
  // Not all "bogus" characters map to 0xFF, but 0xFF in the output only happens on invalid input
  
  __m128i shifted = _mm_srli_epi32(input, 1);   // And then mask, to emulate srli_epi8
  __m128i masked  = _mm_and_si128(shifted, _mm_set1_epi8(0x0F));
  __m128i nucleotide_codes = _mm_shuffle_epi8(LUT,  masked);
  return nucleotide_codes;
}

   // compiles to:
    vmovdqa xmm1, XMMWORD PTR .LC2[rip]       # the lookup table
    vpsrld  xmm0, xmm0, 1       # tmp96, input,
    vpand   xmm0, xmm0, XMMWORD PTR .LC1[rip]     # D.74111, tmp96,
    vpshufb xmm0, xmm1, xmm0  # tmp100, tmp101, D.74111
    ret