5
votes

I have a set of data that is tough to visualize, but I think an ECDF with a couple of points and lines added to it will do the trick. I am able to plot things the way that I want; my problem is coloring things correctly.

I have the following code, which puts all of the right lines and points on the plot, but now I would like to properly color and label everything. I've pored over multiple articles and tried a hundred things, but can't get it right. Do i need to format my data differently?

My vision for the legend is something like this:

  • dashed line = b
  • solid line = a
  • red = s
  • blue = d
  • dot = s.mean

code for generating an example plot is here:

require(ggplot2)
require(reshape2)

s.a = rnorm(100)*100
s.b = rnorm(100)*100+50
d.a = -35
d.b = 20
sdata = data.frame(cbind(s.a,s.b))
ddata = data.frame(cbind(d.a,d.b))
sdata.m = melt(sdata)
ddata.m = melt(ddata)

ggplot(sdata.m, aes(x=value, color=variable)) +
  geom_vline(data=ddata.m,
             aes(xintercept = value,
                 color=variable),
             linetype = 2,
             size=2) + 
  stat_ecdf(size=1)+
  labs(title = 'plotTitle',
       color='colorLegendTitle') +
  xlab('xLabel') +
  ylab('yLabel')+
  theme_bw(30) +
  theme(
    legend.position=c(.8, .2),
    legend.box="horizontal",
    text=element_text(family="Times"),
    legend.key.size = unit(1,"cm")) +
  geom_point(x=mean(sdata.m$value[sdata.m$variable=="s.a"]),y=.5,
             size = 5) +
  geom_point(x=mean(sdata.m$value[sdata.m$variable=="s.b"]),y=.5,
             size = 5)

enter image description here Some context on the data I'm plotting: I have stochastic datasets (s) and deterministic sets (d); each stochastic set will have hundreds of values, while the deterministic sets only have a single value. So in my plot, I'm comparing the distribution of stochastic data (solid lines), and the mean of stochastic data (dots) with the deterministic values (dashed lines). For both the stochastic and deterministic datasets, there are two 'cases' (a) and (b). I would like all (a) and (b) data to share the same color.

This seems like it should be easy with aes and color/linetype/geom mappings, but I can't figure it out.

Thanks in advance.

2
So in the chart above, you want d.a and s.a to be the same colour and d.b and s.b to be the same colour?SlowLearner

2 Answers

4
votes

To get better legend place color=variable and linetype=variable inside aes() for the ggplot() and for geom_vline() - so there will be one legend. Then for geom_point() place x and y inside aes() as well as color="s.mean" and linetype="s.mean". This will ensure that new level is added to legend. Now with scale_color"manual() and scale_linetype_manual() you can set desired colors and linetypes. With guides() and override.aes= you can remove points from first four entries.

ggplot(sdata.m, aes(x=value, color=variable,linetype=variable))+
  stat_ecdf(size=1)+
  geom_vline(data=ddata.m,
             aes(xintercept = value,color=variable,linetype=variable),
             size=2) +
  geom_point(aes(x=mean(sdata.m$value[sdata.m$variable=="s.a"]),
       color="s.mean",linetype="s.mean",y=.5),size = 5) +
  geom_point(aes(x=mean(sdata.m$value[sdata.m$variable=="s.b"]),
        color="s.mean",linetype="s.mean",y=.5),size = 5)+
  scale_color_manual(breaks=c("d.a","d.b","s.a","s.b","s.mean"),
                     values=c("blue","blue","red","red","green"))+
  scale_linetype_manual(breaks=c("d.a","d.b","s.a","s.b","s.mean"),
                     values=c(1,2,1,2,0))+
  guides(color=guide_legend(override.aes=list(shape=c(NA,NA,NA,NA,16))))

enter image description here

3
votes

Didzis gets credit for the answer; I was able to adapt his code and get to the final product I was looking for:

ggplot(sdata.m, aes(x=value, color=variable,linetype=variable,shape=variable))+
  stat_ecdf(size=1)+
  geom_vline(data=ddata.m,
             aes(xintercept = value,color=variable,linetype=variable,shape=variable),
             size=2) +
  geom_point(aes(x=mean(sdata.m$value[sdata.m$variable=="s.a"]),
                 color="s.a.mean",linetype="s.a.mean",shape="s.a.mean",
                 y=.5),size = 5) +
  geom_point(aes(x=mean(sdata.m$value[sdata.m$variable=="s.b"]),
                 color="s.b.mean",linetype="s.b.mean",shape="s.b.mean",
                 y=.5),size = 5) +
  scale_shape_manual(breaks=c("d.a","d.b","s.a","s.a.mean","s.b","s.b.mean"),
                     values=c(16,16,16,16,16,16)) +
  scale_color_manual(breaks=c("d.a","d.b","s.a","s.a.mean","s.b","s.b.mean"),
                     values=c("blue","red","blue","blue","red","red"))+
  scale_linetype_manual(breaks=c("d.a","d.b","s.a","s.a.mean","s.b","s.b.mean"),
                        values=c(2,2,1,0,1,0))+
  guides(color=guide_legend(override.aes=list(shape=c(NA,NA,NA,16,NA,16))))

enter image description here A couple of things I learned:

  1. when adding the breaks/values in scale_manual, alphabetical order is important.
  2. when all parameters (linetype/shape/color) are mapped to the same thing 'variable', you can get everything in one legend
  3. when overriding things with manual scales, you need to make one of each scale, and then override with 'guides' if need be

Thanks again Didzis. Another life, saved.