data.table: get index per group || two data.table

Question

I have two data frame in data.table forms. One has grouped data for which I would like extract index of values from the second data.table frame. Below are sample data

snp_bygene<-data.table(V2=c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP11","SNP12","SNP13","SNP14","SNP15"),
GENE=c( rep("GENE1",5),rep("GENE2",5) ),START=c(rep(100,5),rep(200,5)),END=c(rep(190,5),rep(290,5)) )

snp_data<-data.table(V2=c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP11","SNP12","SNP13","SNP14","SNP15"),BP=c(101,102,105,110,125,201,202,205,210,225))

I would like to get index for V2 in snp_bygene matched by snp_data V2. Per gene I would like to get SNP position.

setkey(snp_data, V2)
snp_bygene[snp_data]

How do I proceed?

Final output would be look like:
finalindex_perGene<-list("GENE1"=c(1, 2, 3, 4, 5) , "GENE2" =c(6, 7, 8, 9, 10))

Edit 1: there is no GENE group in snp_data

Hi - updated my question. Thank you. No there is no group in snp_data — Death Metal

akrun akrun · Accepted Answer · 2021-01-14T18:28:10

We could do a non-equi join on 'START', 'END' columns with 'BP' and equi join with 'V2', get the row index with .I, append the 'GENE' column in the list and then split the .I (I is the default column name created for .I as we didn't specify any column name - it could be made explicit by .(I = .I, GENE)), by 'GENE'

with(snp_bygene[snp_data, .(.I, GENE), on = .(V2, START <= BP, 
       END >= BP)], split(I, GENE))

-output

#$GENE1
#[1] 1 2 3 4 5

#$GENE2
#[1]  6  7  8  9 10

data.table: get index per group || two data.table

2 Answers