1
votes

A simpler version of the original question which I asked but nobody answered it yet.

I have a huge input file (a representative sample of which is shown below as input):

> input
           CT1           CT2           CT3
1 chr1:200-400  chr1:250-450  chr1:400-800
2 chr1:800-970  chr2:200-500  chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400

I want to process it by following a rule (described below) so that I get an output like:

 > output
              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   1
chr2:300-700    1   1   0
chr1:250-450    1   1   1
chr2:200-500    1   1   0
chr2:600-1000   1   1   1
chr1:400-800    0   1   1
chr1:700-870    1   0   1
chr2:700-1400   0   1   1

Rule: Take every index (the first in this case is chr1:200-400) of the dataframe, see if it overlaps with any other value in the dataframe. If yes, write 1 below that column in which it exists, if not write 0.

For example, if we take 1st index of the input input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this range overlap with any other range which exists in any of the other columns in the input. This value overlaps only with the first value (chr1:250-450) of the second column (CT2), therefore, we write 1 below that as well. As there is no overlap with any of the values in CT3, we write 0 below CT3 in the output dataframe.

Here are the dput of input and output:

> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800", 
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500", 
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
1
So this is the exact copy of your other question?zx8754
Not the exact copy, there the rule was bit complicated, where I want 1 for only those overlaps which are more than 50% of the total range. Here I am asking only for the overlaps. If there is any overlap, assign 1.Newbie
1. convert wide-to-long, 2. parse the strings into 3 columns "chr, start, end". 3. split on CTs, 4. merge on overlap, example1, example2 etczx8754
you would need to define 'overlap' because I cannot understand the allocation of zeros and ones in your output - I would expect something elseRolandASc
@RolandASc Overlap between input[1,1] (200-400) and input[1,2] (250-400) is 150. Additinoally chr should be same, which is chr1 in this case. And there is no overlap of input[1,1] region with any of the input[,3] regions as for chr1 they start from 400. Does it make sense now?Newbie

1 Answers

3
votes

A possible solution using the data.table-package:

# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)

# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
                                      , by = variable][]

# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)

# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
   ][, dcast(.SD, value ~ i.variable, fun = f)]

which gives:

           value CT1 CT2 CT3
1:  chr1:200-400   1   1   0
2:  chr1:250-450   1   1   1
3:  chr1:400-800   0   1   1
4:  chr1:700-870   1   0   1
5:  chr1:800-970   1   0   1
6:  chr2:200-500   1   1   0
7:  chr2:300-700   1   1   0
8: chr2:600-1000   1   1   1
9: chr2:700-1400   0   1   1