I would like to test a bunch of genomic locations of the form:
chr4:154723876-154724615
chr6:139580853-139581090
chr18:30440532-30441569
I want to see whether they are located in an UTR or intron or exon or an intergenic sequence. I don't care for information about in which genes' introns (etc.) these coordinates are.
I assume that each known genetic element (like an exon) has defined genomic location (start-end position in the genome on each chromosome). I know this is true for exons and introns, as for example Ensembl has IDs for each exon in the genome: see example of exons and introns of Amy1 gene in Mus musclulus. I want to query a database of such locations with the above list of my locations, and if there is an overlap between the two (ideally I should be able to specify the overlap, say, at least 10bp, but if not I am OK), I should get a hit (yes, this region is in the exon/intron/)
And the handicap is that I have a few thousand of these locations and would ideally like to query them in all one go and as an output have a table where each location would be assigned "intron/exon/utr/intergenic". The organism is Mus musculus and the locations are from across the genome.
I cannot for now provide a code sample of what I am trying to do because I don't know where to start - if I had a package or anything to build upon it would help me find the solution.
Would be perfect if I could do it in R, but AFAIK I can't do it in biomaRt and I couldn't find a package to do it. I thought of Galaxy, but given their non-trivial way of doing it and strange output they produce I would rather stick to R. The devil you know etc.
Help would be much appreciated.
R
code, post the algorithm you're trying to code up so we can see you've tried something. – Carl Witthoft