5
votes

I would like to test a bunch of genomic locations of the form:

chr4:154723876-154724615
chr6:139580853-139581090
chr18:30440532-30441569

I want to see whether they are located in an UTR or intron or exon or an intergenic sequence. I don't care for information about in which genes' introns (etc.) these coordinates are.

I assume that each known genetic element (like an exon) has defined genomic location (start-end position in the genome on each chromosome). I know this is true for exons and introns, as for example Ensembl has IDs for each exon in the genome: see example of exons and introns of Amy1 gene in Mus musclulus. I want to query a database of such locations with the above list of my locations, and if there is an overlap between the two (ideally I should be able to specify the overlap, say, at least 10bp, but if not I am OK), I should get a hit (yes, this region is in the exon/intron/)

And the handicap is that I have a few thousand of these locations and would ideally like to query them in all one go and as an output have a table where each location would be assigned "intron/exon/utr/intergenic". The organism is Mus musculus and the locations are from across the genome.

I cannot for now provide a code sample of what I am trying to do because I don't know where to start - if I had a package or anything to build upon it would help me find the solution.

Would be perfect if I could do it in R, but AFAIK I can't do it in biomaRt and I couldn't find a package to do it. I thought of Galaxy, but given their non-trivial way of doing it and strange output they produce I would rather stick to R. The devil you know etc.

Help would be much appreciated.

4
Hi. Please post a sample of your data, along with examples of what it means to be "located in a UTR" etc. In addition, even if you don't have R code, post the algorithm you're trying to code up so we can see you've tried something.Carl Witthoft
Hi, thanks. I edited a lot to make it clearer. Hope it's better now.yotiao
Carl, this is not about an algorithm, you can’t decide that kind of thing with an algorithm!! This is about a package that would do queries in an appropriate database. I am interested by an answer as well.Elvis
You hould ass 'bionformatics' and maybe 'bioconductor' in the tags. 'mapping' doesn’t seem appropriate.Elvis
More appropriate on biostars.orgArun

4 Answers

1
votes

OK, sorry it took me so long, but the paper is submitted and the way I did it finally was to:

1) Download the list of genomic coordinates for whole genes, exons, introns and so-called 3'-UTR exons and 5'-UTR exons from UCSC table browser using Ensembl gene annotation. The only finicky bit is that you have to download a file for whole genes and the rest separately, and the manual does not explicitly state what "whole gene" is. But if you paste the coordinates it produces into Genome Browser you could see it is 5' UTR, all introns and axons and 3' UTR.

2) Use BEDtools package (Quinlan and Hall 2010, https://www.ncbi.nlm.nih.gov/pubmed/20110278), a very nice manual with simple examples is here: http://bedtools.readthedocs.org/en/latest/ and used the intersect command with -f flag that let me set a minimum overlap (in bp or in %) between my coordinates and the UCSC one.

It worked like a charm - I got a tabulated file with overlaps of each feature. Hope this helps.

0
votes

NCBI has a chromosome map viewer

http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?TAXID=9606&CHR=4&MAPS=ideogr,morbid[11164.00%3A11170.00]&QSTR=EVC%20OR%20HD%20OR%20FGFR3%20OR%20SNCA%20OR%20NRCLP%20OR%20FOP&QUERY=uid(1968,2105,2886,6280,13348,20241,9026199,9026201,9026283,9026440,9027752,9027884)&zoom=100

on the left you have two search boxes that says region show.

0
votes

This is not a complete answer, but I hope this helps.

The bioconductor package BSgenome.Mmusculus.UCSC.mm10 contains the (last assembly of) the mouse sequence. These two lectures (1 and 2) show how to use this kind of package. It seems that you can retrieve exons and introns with the bioconductor package GenomicFeatures which interacts with the UCSC database.

I think this is a good direction to look at. I don’t have time to find more information, please share what you will find.

0
votes

as always there are several ways of doing things, but one really fast is using HOMER annotatePeaks.pl script. You just need to download the HOMER scripts and call: annotatePeaks.pl.

your_bed_file genome > your_output_file.

You can find the documentation in link Just be careful and format your bed file (what you called "genomic locations" file) following the columns structure that HOMER recognize. In the output file, you will get a column called "annotation" and another called "detailed annotation" where you will find the information of in which genomic region you find that coordinates.

(intron, exons, 5'UTR,3'UTR, intergenic, non-coding, GC islands...)

Is the faster way but some of the genes match that you can get using Bedtools maybe are not found by HOMER because the annotation does not exist. This has happened to me, in this case, what I did was, using R biomaRt, download the info of these genes that were found by bedtools and without HOMER annotation, and annotate them with the TSS, 5'UTR,exons starts and ends and then as you have the exact location of your peak r binding position with a "for if" statement you can find the genomic region.