I am new to R and am trying to accomplish the following task efficiently
.
I have a data.frame
, x
, with columns: start
, end
, val1
, val2
, val3
, val4
. The columns are sorted/ordered by start
.
For each start
, first I have to find all the entries in x
that share the same start
. Because the list is ordered, they will be consecutive. If a particular start
occurs only once, then I ignore it. Then, for these entries that have the same start
, lets say for one particular start
, there are 3 entries, as shown below:
entries for start=10
start end val1 val2 val3 val4 10 25 8 9 0 0 10 55 15 200 4 9 10 30 4 8 0 1
Then, I have to take 2 rows at a time and perform a fisher.test
on the 2x4
matrices of val1:4
. That is,
row1:row2 => fisher.test(matrix(c(8,15,9,200,0,4,0,9), nrow=2)) row1:row3 => fisher.test(matrix(c(8,4,9,8,0,0,0,1), nrow=2)) row2:row3 => fisher.test(matrix(c(15,4,200,8,4,0,9,1), nrow=2))
The code I wrote is accomplished using for-loops
, traditionally. I was wondering if this could be vectorized or improved in anyway.
f_start = as.factor(x$start) #convert start to factor to get count tab_f_start = as.table(f_start) # convert to table to access count o_start1 = NULL o_end1 = NULL o_start2 = NULL o_end2 = NULL p_val = NULL for (i in 1:length(tab_f_start)) { # check if there are more than 1 entries with same start if ( tab_f_start[i] > 1) { # get all rows for current start cur_entry = x[x$start == as.integer(names(tab_f_start[i])),] # loop over all combinations to obtain p-values ctr = tab_f_start[i] for (j in 1:(ctr-1)) { for (k in (j+1):ctr) { # store start and end values separately o_start1 = c(o_start1, x$start[j]) o_end1 = c(o_end1, x$end[j]) o_start2 = c(o_start2, x$start[k]) o_end2 = c(o_end2, x$end[k]) # construct matrix m1 = c(x$val1[j], x$val1[k]) m2 = c(x$val2[j], x$val2[k]) m3 = c(x$val3[j], x$val3[k]) m4 = c(x$val4[j], x$val4[k]) m = matrix(c(m1,m2,m3,m4), nrow=2) p_val = c(p_val, fisher.test(m)) } } } } result=data.frame(o_start1, o_end1, o_start2, o_end2, p_val)
Thank you!
plyr
package for approaches to this problem. However ... it is very likely that the bottleneck in your code is the Fisher exact test evaluation, so you are likely to end up with more compact code but not much faster code. (I'd be happy to be proved wrong.) β Ben Bolker