1
votes

I want to plot a ggplot2 boxplot using all columns of a data.frame, and I want to reorder the columns by the median for each column, rotate the x-axis labels, and fill each box with the colour corresponding to the same median. I can't figure out how to do the last part. There are plenty of examples where the fill colour corresponds to a factor variable, but I haven't seen a clear example of using a continuous variable to control fill colour. (The reason I'm trying to do this is that the resultant plot will provide context for a force-directed network graph with nodes that will be colour-coded in the same way as the boxplot -- the colour will then provide a mapping between the two plots.) It would be nice if I could re-use the value-to-colour mapping for later plots so that colours are consistent between plots. So, for example, the box corresponding to the column variable with a high median value will have a colour that denotes this mapping and matches perfectly the colour for the same column variable in other plots (such as the corresponding node in a force-directed network graph).

So far, I have something like this:

# Melt the data.frame:
DT.m <- melt(results, id.vars = NULL) # using reshape2
# I can now make a boxplot for every column in the data.frame:
g <- ggplot(DT.m, aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
geom_boxplot(???, alpha=0.5)

The colour fill information is what I'm stuck on. "value" is a continuous variable in the range [0,1] and there are 55 columns in my data.frame. Various approaches I've tried seem to result in the boxes being split vertically down the middle, and I haven't got any further. Any ideas?

2
Can you make a reproducible example? Maybe something with a built-in data set? We can't run your code without your results data.Gregor Thomas

2 Answers

7
votes

You can do this by adding the median-by-group to your data frame and then mapping the new median variable to the fill aesthetic. Here's an example with the built-in mtcars data frame. By using this same mapping across different plots, you should get the same colors:

library(ggplot2)
library(dplyr)

ggplot(mtcars %>% group_by(carb) %>%
         mutate(medMPG = median(mpg)), 
       aes(x = reorder(carb, mpg, FUN=median), y = mpg)) +
  geom_boxplot(aes(fill=medMPG)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point") +
  scale_fill_gradient(low=hcl(15,100,75), high=hcl(195,100,75))

enter image description here

If you have various data frames with different ranges of medians, you can still use the method above, but to get a consistent mapping of color to median across all your plots, you'll need to also set the same limits for scale_fill_gradient in each plot. In this example, the median of mpg (by carb grouping) varies from 15.0 to 22.8. But let's say across all my data sets, it varies from 13.3 to 39.8. Then I could add this to all my plots:

scale_fill_gradient(limits=c(13.3, 39.8), 
                    low=hcl(15,100,75), high=hcl(195,100,75))

This is just for illustration. For ease of maintenance if your data might change, you'll want to set the actual limits programmatically.

0
votes

I built on eipi10's solution and obtained the following code which does what I want:

# "results" is a 55-column data.frame containing 
# bootstrapped estimates of the Gini impurity for each column variable
# (But can synthesize fake data for testing with a bunch of rnorms)
DT.m <- melt(results, id.vars = NULL) # using reshape2
g <- ggplot(DT.m %>% group_by(variable) %>%
          mutate(median.gini = median(value)), 
        aes(x = reorder(variable, value, FUN=median), y = value))  +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot(aes(fill=median.gini)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradientn(colours = heat.colors(9)) +
ylab("Gini impurity") +
xlab("Feature") +
guides(fill=guide_colourbar(title="Median\nGini\nimpurity"))
plot(g)

Later, for the second plot:

medians <- lapply(results, median)
color <- colorRampPalette(colors = 
heat.colors(9))(1000)[cut(unlist(medians),1000,labels = F)]

color is then a character vector containing the colours of the nodes in my subsequent network graph, and these colours match those in the boxplot. Job done!