4
votes

I am trying to organize the labels of my ggplot scatterplot so that the labels don't overlap with one another. For this purpose, I am trying to use the direct labels library but I cannot get it to work. When I tried the code:

mytable <- read.csv('http://www.fileden.com/files/2012/12/10/3375236/My%20Documents/CF1_deNovoAssembly.csv', sep=",",  header=TRUE)

mytable$Consensus.length <- log(mytable$Consensus.length)

mytable$Average.coverage <-log(mytable$Average.coverage)

mytable$Name <- do.call(rbind,strsplit(as.character(mytable$Name), " ", '['))[,3]

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + geom_point() + ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + opts(title="Contig Coverage vs Length") + geom_text(hjust=0, vjust=-0.2, size=4)
direct.label(p, "first.qp")

I got this error:

Error in direct.label.ggplot(p, "first.qp") : 
  Need colour aesthetic to infer default direct labels.

So I changed the plotting script by adding aes to the geom_point()

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + geom_point(aes(colour=Average.coverage)) + ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + opts(title="Contig Coverage vs Length") + geom_text(hjust=0, vjust=-0.2, size=4)

And now I get the following error

Error in order.labels(d) : labels are not aligned

I found this thread in which they suggest either placing the labels manually if only a few data points or not at all if too many data points. I agree with this but I will be generating this graph with many different data sets and I do need the data labels. So far this is how the graph looks enter image description here

2
Are the differences between each label (172 and 165) meaningful? I'm asking because you could use a colour scale based on a cut of these numbers. Breaking them into groups of 10, or 20, for example. If, for example, they represent a geography or something else that is a measurable distance. - Brandon Bertelsen
Another step might be to remove the points, and plot only the numbers (in which case you will want to set hjust and vjust to 0.5. But I think there is ultimately no way to have all of the labels present, and non-overlapping, and at a large font size - too many of your data points are too close to one another. - Drew Steen
@BrandonBertelsen the differences are not meaningful per se, but I would like to know where 172 and 165 cluster. For instance, I would like to identify which data points cluster in the group of data points between 4.5 and 5.5 in the y axis. - Julio Diaz
@DrewSteen that is an interesting option, could you please advise me as to how to accomplish that - Julio Diaz
I am encountering an indentical problem - MartinT

2 Answers

2
votes

From your comments, it sounds a bit more like a clustering exercise. So, let's go ahead and actually do so:

set.seed(9234970)
d <- data.frame(Name=mytable$Name, 
x=mytable$Consensus.length, 
y=mytable$Average.coverage)
d$kmeans <- as.factor(kmeans(d[-1],20)$cluster)
ggplot(d, aes(x, y, color=kmeans)) + 
geom_point() + 
theme(legend.position="bottom")

kmeans clusters ggplot(d, aes(x, x, label=Name)) + geom_text(aes(x,y)) + facet_wrap(~kmeans, scales="free")

Cluster Breakout

I chose 20 clusters at random

You could also use heirarchical clustering to see a dendogram.

plot(hclust(dist(d[-3]))) # -3 drops kmeans column

I'd recommend playing around with the cluster package in general as it may provide a more useful solution to your problem.

3
votes

You could simply remove the points and plot only the labels, which can be accomplished by commenting out the geom_point() part of your plot. (You'll want to change the hjust and vjust values to 0.5, also, so that the center of the label appears where the point would be):

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + 
  #geom_point() + 
  ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + 
  opts(title="Contig Coverage vs Length") + geom_text(hjust=0.5, vjust=0.5, size=4)

There's still some overlap, but perhaps by adjusting the size of the font and the plot it won't be too serious.

enter image description here