2
votes

I want to label outliers in a ggplot box plot with the name of the subject for which outlying data were observed.

I have proceeded by creating a simple function to identify outliers:

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

And then the safe_ifelse workaround to get ifelse to function properly with factors.

safe.ifelse <- function(cond, yes, no) {
  class.y <- class(yes)
  if (class.y == "factor") {
    levels.y = levels(yes)
  }
  X <- ifelse(cond,yes,no)
  if (class.y == "factor") {
    X = as.factor(X)
    levels(X) = levels.y
  } else {
    class(X) <- class.y
  }
  return(X)
}

From here, I have run data through a dplyr pipeline to produce the plot data at https://www.dropbox.com/s/2pcuuclxiqw1va1/data.csv?dl=0

library(dplyr) data<-subset(data,data$variable1!='NA')

p1<-
  data %>%
  group_by(season,location) %>%
  mutate(outlier=safe.ifelse(is_outlier(variable1),subject,as.numeric(NA))) %>%
  ggplot(aes(x=factor(season),y=variable1))+
  geom_boxplot()+         
  facet_wrap(~location,nrow=2)+
  guides(fill=FALSE)+
  geom_text(aes(label=outlier),na.rm=TRUE,hjust=1.5,size=2.5)

While outliers are correctly identified, labeling does not work as it should. Rather than getting subject-specific outlier labels, three levels of the subject factor are printed repeatedly and erroneously (and seemingly randomly). Labeling outliers by their numerical values (i.e. by changing subject to variable1 in the safe_ifelse function) do not cause problems.

I assume I am missing something obvious here - perhaps someone could kindly indicate where I am going wrong?

Thanks, Andreas

1

1 Answers

0
votes

you need to subset your data in geom_text, to show uniquely the text on the outliers.

data <-data %>%
  group_by(season,location) %>%
  mutate(outlier=safe.ifelse(is_outlier(variable1),subject,as.numeric(NA)))

p1 <- data %>%
  ggplot(aes(x=factor(season),y=variable1))+
  geom_boxplot()+         
  facet_wrap(~location,nrow=2)+
  guides(fill=FALSE)+
  geom_text(data = data[!is.na(data$outlier),],aes(label=subject),hjust=1.5,size=2.5)