2
votes

I am new to R and am trying to accomplish the following task efficiently.

I have a data.frame, x, with columns: start, end, val1, val2, val3, val4. The columns are sorted/ordered by start.

For each start, first I have to find all the entries in x that share the same start. Because the list is ordered, they will be consecutive. If a particular start occurs only once, then I ignore it. Then, for these entries that have the same start, lets say for one particular start, there are 3 entries, as shown below:

entries for start=10

start end val1 val2 val3 val4
   10  25    8    9    0    0
   10  55   15  200    4    9
   10  30    4    8    0    1

Then, I have to take 2 rows at a time and perform a fisher.test on the 2x4 matrices of val1:4. That is,

row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2))
row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2))
row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))

The code I wrote is accomplished using for-loops, traditionally. I was wondering if this could be vectorized or improved in anyway.

f_start = as.factor(x$start) #convert start to factor to get count
tab_f_start = as.table(f_start) # convert to table to access count
o_start1 = NULL
o_end1   = NULL
o_start2 = NULL
o_end2   = NULL
p_val    = NULL
for (i in 1:length(tab_f_start)) {
    # check if there are more than 1 entries with same start
    if ( tab_f_start[i] > 1) {
        # get all rows for current start
        cur_entry = x[x$start == as.integer(names(tab_f_start[i])),]
        # loop over all combinations to obtain p-values
        ctr = tab_f_start[i]
        for (j in 1:(ctr-1)) {
            for (k in (j+1):ctr) {
                # store start and end values separately
                o_start1 = c(o_start1, x$start[j])
                o_end1   = c(o_end1, x$end[j])
                o_start2 = c(o_start2, x$start[k])
                o_end2   = c(o_end2, x$end[k])
                # construct matrix
                m1 = c(x$val1[j], x$val1[k])
                m2 = c(x$val2[j], x$val2[k])
                m3 = c(x$val3[j], x$val3[k])
                m4 = c(x$val4[j], x$val4[k]) 
                m = matrix(c(m1,m2,m3,m4), nrow=2)
                p_val = c(p_val, fisher.test(m))
            }
        }
    }
}
result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)

Thank you!

1
vectorization is a good idea, you should check out the plyr package for approaches to this problem. However ... it is very likely that the bottleneck in your code is the Fisher exact test evaluation, so you are likely to end up with more compact code but not much faster code. (I'd be happy to be proved wrong.) – Ben Bolker

1 Answers

6
votes

As @Ben Bolker suggested, you can use the plyr package to do this compactly. The first step is to create a wider data-frame that contains the desired row-pairs. The row-pairs are generated using the combn function:

set.seed(1)
x <- data.frame( start = c(1,2,2,2,3,3,3,3), 
                 end = 1:8, 
                 v1 = sample(8), v2 = sample(8), v3 = sample(8), v4 = sample(8))

require(plyr)
z <- ddply(x, .(start), function(d) if (nrow(d) == 1) NULL
                                      else {
                                        row_pairs <- combn(nrow(d),2)
                                        cbind( a = d[ row_pairs[1,], ],
                                               b = d[ row_pairs[2,], ] )
                                      })[, -1]

The second step is to extract the p.value from applying the fisher.test to each row-pair:

result <- ddply(z, .(a.start, a.end, b.start, b.end), 
                function(d) 
                     fisher.test(matrix(unlist( d[, -c(1,2,7,8) ]), 
                                        nrow=2, byrow=TRUE))$p.value  )


> result
  a.start a.end b.start b.end         V1
1       2     2       2     3 0.33320784
2       2     2       2     4 0.03346192
3       2     3       2     4 0.84192284
4       3     5       3     6 0.05175017
5       3     5       3     7 0.65218289
6       3     5       3     8 0.75374989
7       3     6       3     7 0.34747011
8       3     6       3     8 0.10233072
9       3     7       3     8 0.52343422