A way of testing a set of genomic locations for exon/intron/utr?

Question

I would like to test a bunch of genomic locations of the form:

chr4:154723876-154724615
chr6:139580853-139581090
chr18:30440532-30441569

I want to see whether they are located in an UTR or intron or exon or an intergenic sequence. I don't care for information about in which genes' introns (etc.) these coordinates are.

I assume that each known genetic element (like an exon) has defined genomic location (start-end position in the genome on each chromosome). I know this is true for exons and introns, as for example Ensembl has IDs for each exon in the genome: see example of exons and introns of Amy1 gene in Mus musclulus. I want to query a database of such locations with the above list of my locations, and if there is an overlap between the two (ideally I should be able to specify the overlap, say, at least 10bp, but if not I am OK), I should get a hit (yes, this region is in the exon/intron/)

And the handicap is that I have a few thousand of these locations and would ideally like to query them in all one go and as an output have a table where each location would be assigned "intron/exon/utr/intergenic". The organism is Mus musculus and the locations are from across the genome.

I cannot for now provide a code sample of what I am trying to do because I don't know where to start - if I had a package or anything to build upon it would help me find the solution.

Would be perfect if I could do it in R, but AFAIK I can't do it in biomaRt and I couldn't find a package to do it. I thought of Galaxy, but given their non-trivial way of doing it and strange output they produce I would rather stick to R. The devil you know etc.

Help would be much appreciated.

Hi. Please post a sample of your data, along with examples of what it means to be "located in a UTR" etc. In addition, even if you don't have R code, post the algorithm you're trying to code up so we can see you've tried something. — Carl Witthoft
Hi, thanks. I edited a lot to make it clearer. Hope it's better now. — yotiao
Carl, this is not about an algorithm, you can’t decide that kind of thing with an algorithm!! This is about a package that would do queries in an appropriate database. I am interested by an answer as well. — Elvis
You hould ass 'bionformatics' and maybe 'bioconductor' in the tags. 'mapping' doesn’t seem appropriate. — Elvis

yotiao yotiao · Accepted Answer · 2014-04-15T12:22:39

OK, sorry it took me so long, but the paper is submitted and the way I did it finally was to:

1) Download the list of genomic coordinates for whole genes, exons, introns and so-called 3'-UTR exons and 5'-UTR exons from UCSC table browser using Ensembl gene annotation. The only finicky bit is that you have to download a file for whole genes and the rest separately, and the manual does not explicitly state what "whole gene" is. But if you paste the coordinates it produces into Genome Browser you could see it is 5' UTR, all introns and axons and 3' UTR.

2) Use BEDtools package (Quinlan and Hall 2010, https://www.ncbi.nlm.nih.gov/pubmed/20110278), a very nice manual with simple examples is here: http://bedtools.readthedocs.org/en/latest/ and used the intersect command with -f flag that let me set a minimum overlap (in bp or in %) between my coordinates and the UCSC one.

It worked like a charm - I got a tabulated file with overlaps of each feature. Hope this helps.

A way of testing a set of genomic locations for exon/intron/utr?

4 Answers