0
votes

I have a set of genetic SNP data that looks like:

Founder1 Founder2 Founder3 Founder4 Founder5 Founder6 Founder7 Founder8 Sample1 Sample2 Sample3 Sample...
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T
A A A T T T T T A T A T   

The size of the matrix is 56 columns by 46482 rows. I need to first bin the matrix by every 20 rows, then compare each of the first 8 columns (founders) to each columns 9-56, and divide the total number of matching letters/alleles by the total number of rows (20). Ultimately I need 48 8 column by 2342 row matrices, which are essentially similarity matrices. I have tried to extract each pair separately by something like:

"length(cbind(odd[,9],odd[,1])[cbind(odd[,9],cbind(odd[,9],odd[,1])[,1])[,1]=="T" & cbind(odd[,9],odd[,1])[,2]=="T",])/nrow(cbind(odd[,9],odd[,1]))"

but this is nowhere near efficient, and I do not know of a faster way of applying the function to every 20 rows and across multiple pairs.

In the example given above, if the rows were all identical like shown across 20 rows, then the first row of the matrix for Sample1 would be:

1 1 1 0 0 0 0 
1

1 Answers

0
votes

I think this is what you want? It helps to break the problem down into smaller pieces and then repeatedly apply a function to those pieces. My solution takes a few minutes to run on my laptop, but I think it should give you or others a start. If you're looking for better speed, I'd recommend looking at the data.table package. I'm sure there are other ways to make the code below a little faster too.

# Make a data set of random sequences
rows = 46482
cols = 56
binsize = 20
founder.cols = 1:8
sample.cols = setdiff(1:cols,founder.cols)
data = as.data.frame( matrix( sample( c("A","C","T","G"), 
                                      rows * cols, replace=TRUE ), 
                              ncol=cols ) )

# Split the data into bins
binlevels = gl(n=ceiling(rows/binsize),k=20,length=rows)
databins = split(data,binlevels)

# A function for making a similarity matrix
compare_cols = function(i,j,mat) mean(mat[,i] == mat[,j])
compare_group_cols = function(mat, group1.cols, group2.cols) {
  outer( X=group1.cols, Y=group2.cols, 
        Vectorize( function(X,Y) compare_cols(X,Y,mat) ) )
}

# Apply the function to each bin
mats = lapply( databins, compare_group_cols, sample.cols, founder.cols )

# And just to check. Random sequences should match 25% of the time. Right?
hist( vapply(mats,mean,1), n=30 ) # looks like this is the case