1
votes

How can compare two data.frame (df1 and df2) and extract the rows with common gene names

df1 =

logp    chr start   end CNA Genes   No.of.genes
25.714.697  1   90100868    90212160    gain    Iqca,Ackr3  2
2.213.423   1   175422136   176019087   loss    Rgs7,Fh1,Kmo,Opn3,Chml,Wdr64,Gm25560,Exo1,Gm23805,Pld5,B020018G12Rik    11
5.607.005   2   145619035   147312698   gain    Slc24a3,Rin2,Naa20,Crnkl1,4930529M08Rik,Insm1,Ralgapa2,Xrn2,Nkx2-4,Nkx2-2,Gm22261   11
3.756.075   2   141246149   141653989   loss    Macrod2 1
4.852.608   2   41586450    41739605    loss    Lrp1b   1
590.684 2   86729423    86860061    loss    Olfr1089,Olfr1090,Olfr1093,Olfr1093,Olfr141,Olfr1094,Olfr1094,Olfr1095  8
5.721.239   3   25408115    25519319    gain    Nlgn1   1
4.295.527   3   92005564    92134972    gain    Pglyrp3,Prr9    2
4.257.749   3   15244004    15897870    gain    Gm9733,Gm9733,Gm9733,Gm9733,Sirpb1a,Sirpb1a,Sirpb1a,Sirpb1a,Sirpb1b,Sirpb1b,Sirpb1b,Sirpb1b,Sirpb1c,Sirpb1c,Sirpb1c,Sirpb1c 16
418.259 3   154861710   155490219   loss    Tnni3k,Tnni3k,Fpgt,Gm26456,Lrriq3   5
2.284.327   4   134885344   137474898   gain    Rhd,Rhd,Tmem50a,D4Wsu53e,Syf2,Runx3,Clic4,Srrm1,Ncmap,Rcan3,Nipal3,Stpg1,Gm25317,Grhl3,Gm23106,Ifnlr1,Il22ra1,Myom3,Srsf10,Pnrc2,Pnrc2,Cnr2,Fuca1,Hmgcl,Gale,Lypla2,Pithd1,Tceb3,Rpl11,Gm26001,Id3,E2f2,Asap3,Tcea3,Zfp46,Hnrnpr,Htr1d,Luzp1,Kdm1a,4930549C01Rik,Lactbl1,Ephb2,C1qb,C1qc,C1qa,Epha8,Zbtb40,Gm23834,Gm23834,Wnt4,Cdc42,Gm13011,Gm13011,Cela3b,Cela3b,Hspg2   56
1.017.899   4   108176679   108417038   gain    Echdc2,Zyg11a,Zyg11b,Selrc1,Fam159a,Gpx7    6
2.229.929   4   80406963    83998058    gain    Tyrp1,Lurap1l,Mpdz,n-R5s187,Nfib,Zdhhc21,Cer1,Frem1,Ttc39b,Gm23412,Snapc3,Psip1,Ccdc171,Gm25899,Gm25899 15
279.458 4   110534756   110628705   gain    Agbl4   1
1.103.167   4   121565222   124833802   gain    Ppt1,Cap1,Mfsd2a,Mycl,Trit1,Bmp8b,Bmp8b,Oxct2b,Ppie,Hpcal4,Nt5c1a,Heyl,Pabpc4,Gm25788,Gm22154,Bmp8a,Bmp8a,Oxct2a,Macf1,Ndufs5,Akirin1,Rhbdl2,Mycbp,Rragc,Gm22983,Pou3f1,Utp11l,Gm24480,Fhl3,Sf3a3,Inpp5b,Mtf1,n-R5s192  33
1.781.441   4   139917291   140083763   loss    Klhdc7a,Igsf21  2
6.829.744   6   147086557   147179673   gain    Mansc4,Klhl42   2
1.070.905   6   63350920    64077379    loss    Grid2   1
3.132.886   7   17188025    18205037    gain    Psg29,Ceacam5,Ceacam14,Gm5155,Ceacam11,Ceacam13,Ceacam12,Igfl3,Igfl3    9
591.926 7   26773232    26976928    gain    Cyp2a5,Cyp2a5,Cyp2a5,Cyp2a22,Cyp2a22,Cyp2a22    6
4.170.656   7   20654493    24128503    gain    Nlrp4e,Nlrp5,Gm10175,Zfp180,Zfp112  5
2.494.001   7   38898625    38991306    loss    Gm21142,Gm25671 2
13.222.294  7   67330026    67943164    loss    Mef2a,Lrrc28,Gm23233,Ttc23,Synm 5
1.330.269   7   7171339 10865583    loss    Zfp418,Clcn4-2,Zik1,Nlrp4b  4
3.414.431   8   49942996    51497632    loss    Gm23986 1
3.059.542   9   21959210    22072123    gain    Epor,Rgl3,Ccdc151,Prkcsh,Elavl3,Zfp653  6
5.277.845   10  80335500    80575991    gain    Reep6,Adamtsl5,Plk5,Mex3d,Mbd3,Uqcr11,Uqcr11,Tcf3,Gm25044,Gm25044,Gm25044,Gm25044,Onecut3,Atp8b3,Rexo1,Klf16    16
26.812.338  10  100597718   100692256   loss    1700017N19Rik   1
6.998.267   11  60393963    60504695    gain    Lrrc48,Atpaf2,Gid4,Drg2,Myo15   5
2.624.723   11  75676344    76212635    gain    Crk,Ywhae,Doc2b,Rph3al,1700016K19Rik,Fam101b,Vps53,Glod4,Fam57a,Gemin4  10
11.851.916  11  97742687    97853778    gain    Pip4k2b,Cwc25,1700001P01Rik,Rpl23,SNORA21,Snora21,Lasp1 7
3.553.325   11  74899198    75121318    loss    Tsr1,Srr,Smg6,Gm22733   4
309.751 11  105624215   107309569   loss    Tanc2,Cyb561,Ace,Ace,Kcnh6,Dcaf7,Taco1,Map3k3,Limd2,Strada,Ccdc47,Ddx42,Ftsj3,Psmc5,Gm23645,Smarcd2,Tcam1,Gh,Gh,Gh,Gh,Gh,Cd79b,Scn4a,2310007L24Rik,Icam2,Ern1,Snord104,Gm22711,Tex2,Milr1,Gm25889,Polg2,Ddx5,Cep95,Smurf2,Bptf,Nol11,Pitpnc1    39
2.642.471   11  30118384    30155192    loss    Sptbn1  1
10.304.184  12  114641806   116183315   gain    Ighv1-73,Ighv1-83,Zfp386,Zfp386,Zfp386,Zfp386,Zfp386,Zfp386,Zfp386,Zfp386,Vipr2 11
1.414.343   12  116239354   117192837   loss    Wdr60,Esyt2,Ncapg2,Gm25112,Gm24354,Ptprn2   6
2.875.469   14  10676764    10768859    loss    Fhit    1
7.743.121   14  52237972    52331429    loss    Rab2b,Gm23758,Tox4,Mettl3,Sall2 5
2.689.596   14  43932587    45325020    loss    Ang5,Ang6,Ear2,Ear2,Ptgdr,Ptger2,Txndc16,Gpr137c,Ero1l  9
1.912.962   14  119385279   119496386   loss    Hs6st3  1
950.029 14  118589508   118681878   loss    Abcc4   1
4.105.345   14  3004822 8437757 loss    Flnb,Dnase1l3,Abhd6,Rpp14,Rpp14,Pxk,Pdhb,Kctd6,Acox2,Fam107a,Oit1,4930452B06Rik 12
1.870.555   16  33446020    33668062    loss    Zfp148,Slc12a8  2
3.148.258   17  5087550 8333690 gain    Arid1b,Tmem242,Zdhhc14,Snx9,Synj2,Serac1,Gtf2h5,Tulp4,n-R5s26,Tmem181a,Dynlt1a,Dynlt1b,Tmem181b-ps,Tmem181b-ps,Dynlt1c,Tmem181c-ps,Dynlt1f,Sytl3,Ezr,Rsph3b,Tagap1,Rnaset2b,Rnaset2b,Gm25119,Rps6ka2,Ttll2,Gm9992,Gm26057,Fndc1,Tagap,Rsph3a,Gm22416,Rnaset2a,Rnaset2a,Fgfr1op,Ccr6,Mpc1,Sft2d1 38
50.819.398  17  40052632    40331607    gain    Gm7148,Pgk2,Crisp3,Crisp1   4
4.099.936   17  14074943    15508274    loss    Dact2,Smoc2,Thbs2,Gm23352,Wdr27,1600012H06Rik,Phf10,Gm3417,9030025P20Rik,Gm3448,Gm3435,Tcte3,Ermard,Dll1,Fam120b,Psmb1,Tbp  17
12.022.555  17  30590875    31053645    loss    Glo1,Dnah8,Gm24661,Gm24661,Gm24661,Gm24661,Gm24661,Gm24661,Gm24661,Gm24661,Glp1r,Umodl1 12
5.135.466   17  36160573    36277761    loss    Gm22453,Rpp21,Trim39    3
4.254.769   17  27372278    27593833    loss    Grm4,Hmga1,Nudt3    3
5.565.997   18  87905985    87999255    loss    Gm24987,Gm24987 2

df2 =

Recursive_level logp    chr start   end CNA Genes   No.of.Gene
1   1.416.541   1   68580000    68640000    loss    Erbb4   1
1   7.876.897   1   173840000   174010000   loss    Mndal,Mnda,Ifi203,Ifi202b   4
1   6.280.751   1   173500000   173660000   loss    BC094916,Pydc4,Pyhin1   3
1   7.369.317   1   115900000   116280000   loss    Cntnap5a    1
2   128.766 2   146170000   146660000   gain    4930529M08Rik,Insm1,Ralgapa2    3
1   5.777.222   2   76720000    76800000    loss    Ttn 1
2   1.448.913   3   15360000    16000000    loss    Sirpb1a,Sirpb1a,Sirpb1a,Sirpb1a,Sirpb1b,Sirpb1b,Sirpb1b,Sirpb1b,Sirpb1c,Sirpb1c,Sirpb1c,Sirpb1c 12
1   3.845.977   4   119500000   125160000   gain    AA415398,AA415398,AA415398,AA415398,AA415398,Foxj3,Guca2a,Guca2b,Hivep3,Edn2,Foxo6,Scmh1,Slfnl1,Ctps,Cited4,Kcnq4,Nfyc,Mir30c-1,Mir30e,Rims3,Exo5,Zfp69,Smap2,Col9a2,Zmpste24,Tmco2,Rlf,Ppt1,Cap1,Mfsd2a,Mycl,Trit1,Bmp8b,Bmp8b,Oxct2b,Ppie,Hpcal4,Nt5c1a,Heyl,Pabpc4,Bmp8a,Bmp8a,Oxct2a,Macf1,Ndufs5,Akirin1,Rhbdl2,Mycbp,Rragc,Pou3f1,Utp11l,Fhl3,Sf3a3,Inpp5b,Mtf1,n-R5s192,1110065P20Rik,Yrdc,Maneal,Cdca8,Rspo1,Gnl2,Dnali1,Snip1,Meaf6,Zc3h12a    66
1   1.446.699   4   73900000    74180000    gain    Frmd3   1
1   2.262.305   4   72740000    72880000    gain    Aldoart1    1
1   1.234.215   4   80820000    84340000    gain    Tyrp1,Lurap1l,Mpdz,n-R5s187,Nfib,Zdhhc21,Cer1,Frem1,Ttc39b,Snapc3,Psip1,Ccdc171,Bnc2    13
1   123.671 4   108480000   108760000   gain    Zcchc11,Prpf38a,Orc1,Cc2d1b,Zfyve9  5
1   1.418.261   4   139400000   147600000   loss    Ubr4,Iffo2,Aldh4a1,Tas1r2,Pax7,Klhdc7a,Igsf21,Arhgef10l,Rcc2,Padi4,Padi3,Padi1,Padi2,Sdhb,Atp13a2,Mfap2,Crocc,Necap2,Spata21,Szrd1,Fbxo42,Rsg1,Arhgef19,Epha2,Fam131c,Clcnka,Clcnka,Clcnkb,Clcnkb,Hspb7,Zbtb17,Spen,Fblim1,Tmem82,Slc25a34,Plekhm2,Ddi2,Rsc1a1,Agmat,Dnajc16,Casp9,Cela2a,Cela2a,Ctrc,Efhd2,Fhad1,Tmem51,Kazn,Prdm2,Pdpn,Lrrc38,1700012P22Rik,Aadacl3,9430007A20Rik,Dhrs3,Vps13d,Tnfrsf1b,Tnfrsf8,Zfp600,Zfp600,Rex2    61
1   8.113.817   6   129740000   129800000   gain    Klri2   1
1   15.569.108  6   41360000    41480000    loss    Prss3,Prss3,Prss1,Prss1 4
1   2.037.683   6   63480000    63700000    loss    Grid2   1
2   14.694  7   38260000    38280000    gain    Pop4    1
1   14.946  7   35780000    38280000    gain    Zfp507,Tshz3,Zfp536,Uri1,Ccne1,1600014C10Rik,Plekhf1,Pop4   8
1   7.192.011   7   47500000    47620000    loss    Mrgpra2b,Mrgpra3    2
1   1.722.108   7   26000000    26200000    loss    Cyp2b13,Cyp2b9  2
1   12.683.495  7   11350000    11680000    loss    Zscan4f 1
1   1.360.954   10  80900000    81100000    gain    Timm13,Lmnb2,Gadd45b,Gng7,Diras1,Slc39a3,Sgta,Thop1,Creb3l3 9
1   267.959 11  97880000    98000000    gain    Fbxo47,Plxdc1,Arl5c 3
1   1.872.174   11  75860000    76420000    gain    Rph3al,1700016K19Rik,Fam101b,Vps53,Glod4,Fam57a,Gemin4,Rnmtl1,Nxn,Timm22,Abr    11
1   2.811.352   12  113560000   114920000   gain    Ighv14-3,Ighv13-1,Ighv13-1,Ighv13-1,Ighv13-1,Ighv13-1,Ighv6-4,Ighv6-4,Ighv6-4,Ighv6-4,Ighv6-4,Ighv6-5,Ighv6-5,Ighv6-5,Ighv6-5,Ighv6-5   16
1   1.979.667   12  115860000   115980000   loss    Ighv1-83    1
1   2.098.521   12  17420000    21160000    loss    Nol10,Odc1,Hpcal1,5730507C01Rik,Asap2   5
1   21.864.853  13  12580000    12650000    loss    Ero1lb  1
1   3.233.185   13  61500000    62780000    loss    Ctsm,Cts3,Zfp808    3
1   5.640.895   14  53540000    53780000    gain    Trav12-2,Trav12-3,Trav13-2,Trav14-2,Trav15-2-dv6-2,Trav3-3,Trav9-4,Trav9-4,Trav9-4,Trav9-4,Trav4-4-dv10,Trav5-4,Trav6-7-dv9,Trav7-6,Trav7-6,Trav7-6,Trav16,Trav13-4-dv7,Trav14-3,Trav3-4    20
1   2.942.081   14  86300000    97240000    gain    Diap3,Tdrd3,Rps3a2,Pcdh20,Pcdh9,Klhl1   6
1   4.662.806   14  9840000 9880000 loss    Fhit    1
1   3.638.346   14  43740000    44640000    loss    Ear1,Ear1,Ear10,Ear10,Ang5,Ang6,Ear2,Ear2   8
1   1.709.546   14  35320000    37400000    loss    Grid1,n-R5s46,Ccser2,Rgr,Lrit1,Lrit2,Cdhr1,2610528A11Rik,Ghitm  9
2   3.387.282   14  84060000    85740000    loss    Pcdh17  1
1   2.140.909   14  68280000    86300000    loss    Adam7,Adamdec1,Adam28,Stc1,Nkx2-6,Nkx3-1,Slc25a37,Synb,Entpd4,SYNB,Loxl2,R3hcc1,Chmp7,Tnfrsf10b,Tnfrsf10b,Tnfrsf10b,Tnfrsf10b,Rhobtb2,Pebp4,Egr3,Bin3,Ccar2,9930012K11Rik,9930012K11Rik,Pdlim2,Sorbs3,Ppp3cc,Slc39a14,Piwil2,Polr3d,Mir320,Phyhip,Bmp1,Sftpc,Lgi3,Reep4,Hr,Nudt18,Fam160b2,Dmtn,Fgf17,Npm2,Xpo7,Dok2,Gfra2,Fndc3a,Cysltr2,Rcbtb2,Rb1,Lpar6,Itm2b,Med4,Nudt15,Sucla2,Htr2a,Esd,Lrch1,5031414D18Rik,Lrrc63,Lcp1,Cpb2,Zc3h13,Siah3,Spert,Cog3,Slc25a30,Tpt1,Snora31,Gtf2f2,Kctd4,Gpalpp1,Nufip1,Rps2-ps6,Tsc22d1,Serp2,Lacc1,Ccdc122,Enox1,n-R5s48,Dnajc15,Epsti1,Fam216b,Tnfsf11,Akap11,Dgkh,Vwa8,Zfp957,Rgcc,Naa16,Mtrf1,Kbtbd7,Zbtbd6,Wbp4,Elf1,Sugt1,Lect1,Pcdh8,Olfm4,Pcdh17  99
1   3.810.267   14  109680000   111240000   loss    n-R5s50,Slitrk6 2
1   3.924.724   15  77460000    77560000    loss    Apol10a,Apol10a,Apol10a,Apol10a,Apol11a,Apol11a,Apol11a,Apol11a,Apol7c  9
1   7.728.161   16  44780000    44920000    gain    Cd200r1,Cd200r1,Cd200r4,Cd200r4,Cd200r2,Cd200r2 6
1   348.511 17  73500000    76640000    gain    Galnt14,Ehd3,Xdh,Memo1,Dpy30,Spast,Slc30a6,Nlrc4,Yipf4,Birc6,Ttc27,Ltbp1,Rasgrp3,Fam98a 14
1   1.052.043   17  36120000    36540000    gain    Rpp21,Trim39    2
1   1.325.386   17  90420000    90540000    loss    Nrxn1   1
1   4.438.061   17  38300000    38360000    loss    Olfr137,Olfr137 2
1   125.062 17  30380000    30920000    loss    Btbd9,Glo1,Dnah8,Glp1r  4
1   2.998.359   19  13860000    13900000    gain    Olfr1502    1
2   3.307.524   19  30910000    30970000    loss    Prkg1   1

When i tried df2[mapply(function(x, y) length(intersect(x,y))>0, strsplit(df1$Gene, ','), strsplit(df2$Gene, ',')),]

i got out

logp chr    start      end  CNA    Genes No.of.genes
39 2.689.596  14 43932587 45325020 loss Ang5,Ang6,Ear2,Ear2,Ptgdr,Ptger2,Txndc16,Gpr137c,Ero1l           9

But i can find many rows with at least one common Gene

1
Could you show the expected result. Pekg is common for both df1 and df2. So, I guess that would be included. But, in the 3rd and 4th rows, in addition to Sox1, there are other genes also in df1akrun
I guess the OP wants to extract rows with at least one common Gene, otherwise it would have been easy for the OP to do.codingEnthusiast
@naltiper you are correctbeginner
@beginner You could try strsplit to split up the Gene column` and then do the match. Perhaps df2[mapply(function(x,y) any(x %in% y), strsplit(df1$Gene, ','), strsplit(df2$Gene, ",")),]akrun
@akrun Add it as answer, I think it covers the OP completely.codingEnthusiast

1 Answers

3
votes

We could split up the "Genes" column in each datasets with strsplit, then compare the corresponding list elements with mapply, check if there is any intersect and use that index to subset the "df2"

 df2[mapply(function(x,y) any(x %in% y),
         strsplit(df1$Gene, ','), strsplit(df2$Gene, ",")),]
 #   chr start  end                Gene
 #1 1179  3360 gain Recl,Bcl,Trim3,Pop4
 #3 7180  9229 loss                Sox1
 #4 8159  8360 loss                Sox1
 #5 9154 10588 loss                Pekg

Or use intersect and length

  df2[mapply(function(x, y)  length(intersect(x,y))>0,
       strsplit(df1$Gene, ','), strsplit(df2$Gene, ',')),]

Update

If we need to find whether a single "Gene" of first dataset is found in any of the rows of second data (using the updated dataset)

df2[sapply(strsplit(df2$Gene, ','), function(x) 
  any(sapply(strsplit(df1$Gene, ','), function(y) any(x %in% y)))),]