Is there a way in R's ggplot2 to incorporate the stat summary argument into the legend with geom_line?

Question

I have a dataset with SUD treatment rates for each state for each year from 2008 to 2017. Two states received an intervention and the rest did not. I am able to plot the treatment rates for each intervention state as a separate line and the non-intervention states (grouped as one line using the mean) on the same graph.

I would like to include the grouped line in the legend, but am having trouble doing so. So far only the intervention states are labeled. I've pasted a subset of the data (not exactly what the plot shows, but provides insight into the structure) and provided the ggplot code with the resulting figure. Any help would be appreciated. Thank you!

structure(list(statename = c("Alabama", "Alabama", "Alabama", 
"Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", 
"Alabama", "Colorado", "Colorado", "Colorado", "Colorado", "Colorado", 
"Colorado", "Colorado", "Colorado", "Colorado", "Colorado", "Iowa", 
"Iowa", "Iowa", "Iowa", "Iowa", "Iowa", "Iowa", "Iowa", "Iowa", 
"Iowa", "Washington", "Washington", "Washington", "Washington", 
"Washington", "Washington", "Washington", "Washington", "Washington", 
"Washington"), YEAR = c(2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 2008L, 2009L, 2010L, 2011L, 2012L, 
2013L, 2014L, 2015L, 2016L, 2017L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2008L, 2009L, 2010L, 
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L), RML_ever_state = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L), TotalAdolescent_noprior_Admissionrate = c(4.07959747971533, 
17.1095315089946, 28.605586683907, 11.488423601983, 14.9233623164375, 
13.0325736612464, 11.5686458431041, 15.2887154594351, 18.9275946907701, 
17.9164783752891, 28.9653195613967, 30.542173819128, 33.8611241088185, 
29.9657748758525, 25.214146698236, 22.847991066509, 21.975843495247, 
21.783383749025, 25.8868468603421, 23.6029880132029, 45.9863203727017, 
51.6710909784629, 61.713238062499, 48.3292305542656, 52.4339028740025, 
54.9353761700907, 50.2282357945714, 46.8358255273624, 63.8611763171622, 
63.7710085311979, 50.5294974023959, 53.0407358054702, 54.7486826331652, 
59.981622172154, 57.2193036593259, 54.5478493207391, 49.924220486418, 
41.9396870928129, 11.3163228651143, 0.258492875751707)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 31L, 32L, 33L, 34L, 35L, 
36L, 37L, 38L, 39L, 40L, 117L, 118L, 119L, 120L, 121L, 122L, 
123L, 124L, 125L, 126L, 387L, 388L, 389L, 390L, 391L, 392L, 393L, 
394L, 395L, 396L), class = "data.frame")

ggplot(vars1, aes(x = YEAR, y = TotalAdolescent_noprior_Admissionrate, group = statename))  +
  stat_summary(
    data = ~ subset(., RML_ever_state == 0),
    aes(group = -1),
    fun.data = mean_se,
    geom = "line", size = 3, colour = "green"
  ) +
  geom_line(
    data = ~ subset(., RML_ever_state == 1),
    aes(colour = statename)
  )   + scale_x_discrete( limits=c(2008,2009,2010,2011,2012,2013,2014,2015,2016,2017))

Since the legend is showing the state, one thought is to view the summary of non-intervention states as a separate state (perhaps at this point call the column something different), then you will have all 50 states + the non-intervention summary; this would include it in the same legend. — steveb
Also, it is best to provide a small reproducible example (you are close to it here). You can create a minimal data set and cut / paste the output of your vars1 using dput(vars1) into the post. People will be able to reproduce by just cutting and pasting. You are likely to get answers in a few minutes vs. not at all or hours / days. However, the link I provided above may answer your question. — steveb
That will help others help you. I don't know how to do it using your approach, I typically have gotten around this by creating a new column containing things like the original and summary data, and having a new factor. All items end up in the legend but It will be a different title as I am including the summary data with the non-summarized data. I am not saying it is the best way, just what I have done to view data like this. — steveb
Thank you. I am very new to R, and navigating ggplot has been trickier than I thought it would be. I appreciate your help. I learned about dput() thanks to you. — Tom
I found a manual solution with annotate, by adding the following at the end of ggplot, I can control where to place a label. Not the best solution, but a work around + annotate('text', x=2016, y=15, label = "Non-RML \n state") — Tom

chemdork123 chemdork123 · Accepted Answer · 2021-03-21T18:18:52

A nice solution for this can be realized by understanding how ggplot2 creates legends. We will utilize that to our advantage to force a legend out of stat_summary(). Here, we're going to effectively "hijack" another aesthetic, and then squish the legends together to make it look like it's all one legend.

There are two points to go over to help explain why the final answer is working.

Legend creation inside `aes()`

First, ggplot2 will create a legend for anything that needs to be discriminated when inside aes(). So, if you have an aesthetic like color, fill, or size inside aes(), you're going to start the process of having ggplot2 make a legend. This process is overwritten when you also define the aesthetic outside aes().

So this code generates a legend:

ggplot(df, aes(x,y)) + geom_point(aes(size=my_category))

But this one does not:

ggplot(df, aes(x,y)) + geom_point(aes(size=my_category), size=3)

So, this means that if you wanted to define the size in the first example above, you would have to take whatever ggplot2 does be default or specify the actual size by adding a scale_size_manual() command:

ggplot(df, aes(x,y)) + geom_point(aes(size=my_category)) + scale_size_manual(values=3)

Quoted stuff vs. column names inside `aes()`

You can also control the names of the items in your legend by what you send to aes(). Take the above example. If I put aes(size=my_category), this assumes that there is a column in df called my_category with values. If that column is a factor, the levels of the factor will be used as the legend item names and the particular "size" values in the final plot. If the column is a character vector, the legend will be created by kind of "forcing" the column into a factor and then items will be listed alphanumerically. If df$my_category is a number vector, you will get a continuous scale of size. The game changes a bit when you supply a character/string. Let's assume df$my_category is a character vector. When I send it this:

ggplot(df, aes(x,y)) + geom_point(aes(size="my_category")...)

I'll get everything the same size, and a legend with one item called "my_category". Interesting. Basically, you are supplying the full contents of the column in the second example, where every item is labeled as "my_category". Make sense? We can put this logic together to form your solution.

A Solution to Your Question

Now, I'm going to use this information to create your plot. To separate out that stat_summary() into a legend item, I need to pick an aesthetic. You could just specify that color= is inside aes() for stat_summary and use a character to label it correctly. That will work; however, if you do that all the legend keys (called "glyphs") will look the same as the line in stat_summary. That's fine if all the lines look the same, but your summary line is "fatter", so you want it to be different than the rest.

Here, I'll just create a legend using size, which will be a different legend from the color one used by geom_line(), and they can have their different formats without any issue.

Since I'm pulling size into aes() for stat_summary, I need to remove size outside of aes() in stat_summary and put that value inside a separate scale_size_manual() command. Second, I use guides() to control the order of the legends and some title stuff. Finally, I use theme elements to control the spacing between the two legends to push them a bit closer. The final result:

ggplot(vars1, aes(x = YEAR, y = TotalAdolescent_noprior_Admissionrate, group = statename))  +
  stat_summary(
    data = ~ subset(., RML_ever_state == 0),
    aes(group = -1, size='Summary'),
    fun.data = mean_se,
    geom = "line", colour = "green"
  ) +
  geom_line(
    data = ~ subset(., RML_ever_state == 1),
    aes(colour = statename)
  )   +
  scale_x_discrete( limits=c(2008,2009,2010,2011,2012,2013,2014,2015,2016,2017)) +
  scale_size_manual(values=3) +
  guides(
    color=guide_legend(title='Legend', order=1),
    size=guide_legend(title=NULL, order=2)
  ) +
  theme(
    legend.margin = margin(t=0,b=0),
    legend.spacing.y = unit(0,'pt'),
    legend.title = element_text(margin=margin(b=10))
  )

Is there a way in R's ggplot2 to incorporate the stat summary argument into the legend with geom_line?

1 Answers

Legend creation inside aes()

Quoted stuff vs. column names inside aes()

A Solution to Your Question

Legend creation inside `aes()`

Quoted stuff vs. column names inside `aes()`