Histogram fill not representing all present levels of factor in certain bins

Question

I am using a histogram to visualize distribution of various prices of a beverage offered at stores and applied a fill to represent the proportion of stores that are in a certain status within each price level. Some levels are not appearing in the fill, and some bins aren't present at all even though I am pretty certain they are present in the data. What made me think this problem exists is that I displayed averages of another variable over each bin, and given the value of that label, the fill I end up with shouldn't be possible.

The set up should be fairly simple; I set a geom_histogram with x = price, assigned the fill (cpm.bins), and modified the x axis scale. As mentioned previously, I added a geom_text to display the average cpm for each bin. I noticed things were wrong and messed around with the data frame.

This is a small sample of the data frame I am using, but I believe it will be enough to demonstrate the problem.

library(lemon)
library(ggplot2)

df1 <- data.table::fread(
  "id   size price         cpm.bin int.ave.cpm p.int
    420  12ounce  2.39            Good       32.50   2.4
    629  12ounce  2.78 Underperforming       18.00   2.8
    940  12ounce  2.49  Non-purchasing       22.00   2.5
    1653 12ounce  2.45            Good       22.00   2.5
    1660 12ounce  2.45            Good       22.00   2.5
    2561 20ounce  2.59 Underperforming       13.65   2.6
    2578 20ounce  2.39 Underperforming       26.02   2.4
    2580 20ounce  2.39 Underperforming       26.02   2.4
    2581 20ounce  2.39            Good       26.02   2.4
    2582 20ounce  2.39            Good       26.02   2.4
    2583 20ounce  2.39            Good       26.02   2.4
    2584 20ounce  2.39            Good       26.02   2.4
    2587 20ounce  2.49  Non-purchasing       20.05   2.5
    2589 20ounce  2.99 Underperforming       18.13   3.0
    2599 20ounce  2.49  Non-purchasing       20.05   2.5
    2600 20ounce  2.49 Underperforming       20.05   2.5
    2606 20ounce  2.59  Non-purchasing       13.65   2.6
    2607 20ounce  2.39            Good       26.02   2.4
    2609 20ounce  2.39 Underperforming       26.02   2.4
    2629 20ounce  2.49  Non-purchasing       20.05   2.5
  "
)

df2 <- data.table::fread(
  "id size price         cpm.bin int.ave.cpm p.int
  629  12ounce  2.78 Underperforming       18.00   2.8
  940  12ounce  2.49  Non-purchasing       22.00   2.5
  1653 12ounce  2.45            Good       22.00   2.5
  1660 12ounce  2.45            Good       22.00   2.5
  2561 20ounce  2.59 Underperforming       13.65   2.6
  2587 20ounce  2.49  Non-purchasing       20.05   2.5
  2589 20ounce  2.99 Underperforming       18.13   3.0
  2599 20ounce  2.49  Non-purchasing       20.05   2.5
  2600 20ounce  2.49 Underperforming       20.05   2.5
  2606 20ounce  2.59  Non-purchasing       13.65   2.6
  2629 20ounce  2.49  Non-purchasing       20.05   2.5
  2634 20ounce  2.59  Non-purchasing       13.65   2.6
  2658 20ounce  2.49 Underperforming       20.05   2.5
  2665 20ounce  2.59  Non-purchasing       13.65   2.6
  2671 20ounce  2.69  Non-purchasing       21.18   2.7
  2673 20ounce  2.69            Good       21.18   2.7
  2674 20ounce  2.69            Good       21.18   2.7
  2675 20ounce  2.69 Underperforming       21.18   2.7
  2676 20ounce  2.69            Good       21.18   2.7
  2677 20ounce  2.69            Good       21.18   2.7"
)

when using these data frames for the following ggplot, there is a different fill for the $2.50 bin in the "12 Ounce" plot.

ggplot(df1, aes(x = price)) +
  geom_histogram(aes(fill = cpm.bin), binwidth = 0.1, position = position_fill(), stat = "bin") +
  facet_rep_wrap(~size, nrow = 3, repeat.tick.labels = TRUE, scales = "free") +
  scale_x_continuous(breaks = seq(0, 10, by = 0.1), labels = scales::dollar) +
  geom_text(aes(x = p.int, y = 0.5, label=int.ave.cpm), size=4)

the only difference of these subsets is the minimum possible value for p.int. For df1 the minimum is 2.4, and for df2 the minimum is 2.5.

For the $2.50 bin in the "12 Ounce" section, the fill should be 2/3 "good" (blue), and 1/3 "Non-purchasing" (red) no matter what the minimum value of p.int is. What is going on, and how do I fix this so my plots are displaying values accurately and proportionally when I am using my entire data frame?

Thanks.

I suggested some edits to include your data as code and for the libraries required (lemon). Can you please include a picture in your plot pointing out the change you would like to see? — yake84
Just to clarify something- when I use the data.table::fread that you supply to read your example data- it does not distinguish between '12' and '20' ounce- everything in the size column is just 'Ounce'. I corrected this: df1$size[1:5]=paste('12',df1$size[1:5]); df1$size[6:nrow(df1)]=paste('20',df1$size[6:nrow(df1)]). Then when I run your code I get a very different plot... Does this help you/could it be the source of your problem? — which_command
@which_command all the levels for price in my data frame are good, there must have been an issue when I copy and pasted the data frame. — Che Diaz
@yake84, I am pretty new to SO, and I am still trying to figure out how to upload an image from my device. I will include a pic ASAP. — Che Diaz
@which_command I fixed the size variable so the levels are now one word (e.g. "12ounce" instead of "12 Ounce". — Che Diaz

yake84 yake84 · Accepted Answer · 2019-05-31T15:38:17

I would use geom_bar() instead of geom_histogram() for this. I have modified the code to show how I would make the plot. You can still use the lemon package to do all the neat labeling, etc.

# plot_1 <-
ggplot(df1) +
  geom_bar(aes(x = round(price, 1), fill = cpm.bin), width = 0.09) +
  facet_grid(~size) + 
  scale_fill_manual(values = c("dodgerblue2", "coral1", "mediumseagreen")) +
  xlim(2.3, 3.1) + ylim(0, 10) +
  #theme(legend.position = "none") +
  labs(title = "df1")

# plot_2 <-
ggplot(df2) +
  geom_bar(aes(x = round(price, 1), fill = cpm.bin), width = 0.09) +
  facet_grid(~size) +
  scale_fill_manual(values = c("dodgerblue2", "coral1", "mediumseagreen")) +
  xlim(2.3, 3.1) + ylim(0, 10) +
  labs(title = "df2")

# gridExtra::grid.arrange(plot_1, plot_2, nrow = 2)

Histogram fill not representing all present levels of factor in certain bins

1 Answers