0
votes

I have two data frame in data.table forms. One has grouped data for which I would like extract index of values from the second data.table frame. Below are sample data

snp_bygene<-data.table(V2=c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP11","SNP12","SNP13","SNP14","SNP15"),
GENE=c( rep("GENE1",5),rep("GENE2",5) ),START=c(rep(100,5),rep(200,5)),END=c(rep(190,5),rep(290,5)) )

snp_data<-data.table(V2=c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP11","SNP12","SNP13","SNP14","SNP15"),BP=c(101,102,105,110,125,201,202,205,210,225))

I would like to get index for V2 in snp_bygene matched by snp_data V2. Per gene I would like to get SNP position.

setkey(snp_data, V2)
snp_bygene[snp_data]

How do I proceed?

Final output would be look like:
finalindex_perGene<-list("GENE1"=c(1, 2, 3, 4, 5) , "GENE2" =c(6, 7, 8, 9, 10))

Edit 1: there is no GENE group in snp_data

2
Hi - updated my question. Thank you. No there is no group in snp_data - Death Metal

2 Answers

2
votes

We could do a non-equi join on 'START', 'END' columns with 'BP' and equi join with 'V2', get the row index with .I, append the 'GENE' column in the list and then split the .I (I is the default column name created for .I as we didn't specify any column name - it could be made explicit by .(I = .I, GENE)), by 'GENE'

with(snp_bygene[snp_data, .(.I, GENE), on = .(V2, START <= BP, 
       END >= BP)], split(I, GENE))

-output

#$GENE1
#[1] 1 2 3 4 5

#$GENE2
#[1]  6  7  8  9 10
2
votes

If BP is not involved for merging data tables (i.e., depending on V2 only), we could use chmatch to obtain the matched row indices, e.g.,

> with(snp_bygene, split(chmatch(snp_bygene[, V2], snp_data[, V2]), GENE))
$GENE1
[1] 1 2 3 4 5

$GENE2
[1]  6  7  8  9 10

Otherwise, you may need non-equi and equi joins as @akrun did