0
votes

I have data of the form

cvar1  cvar1  numvar
a      x      0.1
a      y      0.2
b      x      0.15
b      y      0.25

That is, two categorical variables, and one numerical variable.

Using ggplot2, I can get a nice scatter plot that plots the data for each combination of cv1 and cv2 by doing qplot(y=numvar, x=interaction(cvar1, cvar2)). This gives me several columns of points like this:

enter image description here

To each of these columns I would like to add a small horizontal line representing the mean of the data points in that column. And a similar small horizontal line for the mean + sd and the mean - sd. (Kind of a bastardized box plot, but with all points visible and using mean and sd rather than median and IQR.) Thanks in advance!

1
see abline in base, e.g. - MichaelChirico
I don't see how abline, at least as I have used it, would help, because I have multiple columns I need to add lines to. Also, does abline even work with ggplot2? - Ben S.
gotcha. Then the base function you want is segments; I don't know what the equivalent is in ggplot2, hence leaving this as a comment - MichaelChirico
@BenS. There is the function geom_abline in ggplot2. You may also want to look at geom_vline and geom_hline as well. - steveb

1 Answers

3
votes

You can create a table that contains the mean and mean +/- sd for each group of points. Then you can plot lines using geom_segment().

First, I create some sample data:

set.seed(1245)
data <- data.frame(cvar1 = rep(letters[1:2], each = 12),
                   cvar2 = rep(letters[25:26], times = 12),
                   numvar = runif(2*12))

This creates the table with the values that you need using dplyr and tidyr:

library(dplyr)
library(tidyr)
summ <- group_by(data, cvar1, cvar2) %>%
        summarise(mean = mean(numvar),
                  low = mean - sd(numvar),
                  high = mean + sd(numvar)) %>%
        gather(variable, value, mean:high)

The three lines do the following: First, the data is split into the groups and then for each group the three required values are calculated. Finally, the data is converted to long format, which is needed for ggplot(). (Maybe your are more familiar with melt(), which does basically the same thing as gather())

And finally, this creates the plot:

gplot(data) + geom_point(aes(x = interaction(cvar1, cvar2), y = numvar)) +
  geom_segment(data = summ,
               aes(x = as.numeric(interaction(cvar1, cvar2)) - .5, 
                   xend = as.numeric(interaction(cvar1, cvar2)) + .5,
                   y = value, yend = value, colour = variable))

enter image description here

You probably won't want the colours. I just added them to make the example more clear.

geom_segments() needs the start and end coordinates of each line to be specified. Because interaction(cvar1, cvar2) is a factor, it needs to be converted to numeric before it is possible to do arithmetic with it. I added and subtracted 0.5 to interaction(cvar1, cvar2), which makes the lines quite wide. Choosing a smaller value will make the lines shorter.