A simpler version of the original question which I asked but nobody answered it yet.
I have a huge input file (a representative sample of which is shown below as input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
I want to process it by following a rule (described below) so that I get an output
like:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 1
chr2:300-700 1 1 0
chr1:250-450 1 1 1
chr2:200-500 1 1 0
chr2:600-1000 1 1 1
chr1:400-800 0 1 1
chr1:700-870 1 0 1
chr2:700-1400 0 1 1
Rule:
Take every index (the first in this case is chr1:200-400
) of the dataframe, see if it overlaps with any other value in the dataframe. If yes, write 1
below that column in which it exists, if not write 0
.
For example, if we take 1st index of the input input[1,1]
which is chr1:200-400
. As it exists in column 1 we will write 1 below it. Now we will check if this range overlap with any other range which exists in any of the other columns in the input
. This value overlaps only with the first value (chr1:250-450
) of the second column (CT2
), therefore, we write 1 below that as well. As there is no overlap with any of the values in CT3
, we write 0
below CT3
in the output dataframe.
Here are the dput of input
and output
:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
1
for only those overlaps which are more than 50% of the total range. Here I am asking only for the overlaps. If there is any overlap, assign 1. – Newbieinput[1,1]
(200-400) andinput[1,2]
(250-400) is 150. Additinoallychr
should be same, which ischr1
in this case. And there is no overlap ofinput[1,1]
region with any of theinput[,3]
regions as for chr1 they start from 400. Does it make sense now? – Newbie