1
votes

I have a dataset of 3156 DNA sequences, each of which has 98290 characters (SNPs), comprising the (usual) 5 symbols : A, C, G, T, N (gap).

What is the optimal way to find the pairwise Hamming distance between these sequences?

Note that for each sequence, I actually want to find the reciprocal of the number of sequences (including itself), where the per-site hamming distance is less than some threshold (0.1 in this example).

So far, I have attempted the following:

library(doParallel)
registerDoParallel(cores=8)
result <- foreach(i = 1:3156) %dopar% {
 temp <- 1/sum(sapply(snpdat, function(x) sum(x != snpdat[[i]])/98290 < 0.1))
}

snpdat is a list variable where snpdat[[i]] contains the ith DNA sequence.

This takes around 36 minutes to run on a core i7 - 4790 with 16GB ram. I also tried using the stringdist package, which takes more time to generate the same result.

Any help is highly appreciated!

1
a quick google came up with this: johanndejong.wordpress.com/2015/09/23/… - GordonShumway
You might consider using specialized functions, such as dist.hamming in phangorn, or the Biostrings package on Bioconductor. - Axeman
Thanks for the advice. @GordonShumway, my dataset is unforutnately too large for matrix multiplication... - Sudaraka
Thanks for the advice. @Axeman, I will give it a shot now... - Sudaraka

1 Answers

1
votes

I am not sure if this is the most optimal solution, but I was able to bring the run time down to around 15 minutes using Rcpp. I'll write the code here in case someone might find it useful someday...

This is the C++ code (I have used Sugar operators here)...

#include <Rcpp.h>
using namespace Rcpp;

double test5(const List& C, const int& x){
  double HD;
  for(int i = 0; i < 3156; i++) if(sum(CharacterVector(C[x])!=CharacterVector(C[i])) < 9829) HD++;
  return HD;
}

After compiling:

library(Rcpp)
sourceCpp("hd_code.cpp")

I simply call this function from R:

library(foreach)
library(doParallel)
registerDoParallel(cores = 8)
t =Sys.time()
bla = foreach(i = 1:3156, .combine = "c") %dopar% test5(snpdat,i-1)
Sys.time() - t

Can anyone think of an even quicker way to do this?