5
votes

I'm trying to label individual points of interest in a specific scatter plot in ggplot2. My data exists as a csv file with multiple columns.

Gene       chr    start    stop      A      B       C       D      E
APOBEC3G   chr22  39472992 39483773  97.06  214.56  102.34  20.00  19.45  
APOBEC3C ... 

And so on and so forth. I am trying to plot column A v. column B via ggplot and I'm successful and can label all of the points with the corresponding gene name. However, how do I highlight (i.e. color, size change) individual genes of interest? (AKA: How do I make the data point for a list of 10 genes that I have on hand stand out? Or how can I annotate my genes of interest on the scatterplot without annotating all other points?)

I've tried using the subset function but my novice character at R has left me stranded a bit.

1
Just adjust your conditional to a point rather than a range.Shawn Mehan

1 Answers

9
votes

You need to create a new variable that distinguishes the observations you want to highlight.

Let's simulate a data.frame :

df <- data.frame(genes=letters,
                 A=runif(26),
                 B=runif(26))

Your current plot should look like this (point + labels):

ggplot(data=df,aes(x=A,y=B,label=genes)) +
  geom_point() +
  geom_text(hjust=-1,vjust=1)

In order to highlight some genes, we create a new variable, group. I assign "important" to some arbitrary genes. You may want to do this programatically, by looking for outliers for instance.

df$group <- "not important"
df$group[df$genes %in% c("d","g","b")] <- "important"

Now, there are two ways to separate the genes. The most idiosyncratic is to give one colour (or shape, or size, etc.) to both groups (one for important genes, one for unimportant ones). This is easily achieved by mapping the new variable to colour (or size, shape, etc.):

ggplot(data=df,aes(x=A,y=B,label=genes)) +
  geom_point(aes(color=group)) +
  geom_text(hjust=-1,vjust=1)

enter image description here

However, you could also plot each group on a separate layer. To clearly highlight the important genes. In that case, we first add all points, and then add a new geom_point that contains only the important genes, with special attributes (here, color and size).

ggplot(data=df,aes(x=A,y=B,label=genes)) +
  geom_point() +
  geom_point(data=df[df$group == "important",],color="red",size=3) +
  geom_text(hjust=-1,vjust=1)

enter image description here