I am using a histogram to visualize distribution of various prices of a beverage offered at stores and applied a fill to represent the proportion of stores that are in a certain status within each price level. Some levels are not appearing in the fill, and some bins aren't present at all even though I am pretty certain they are present in the data. What made me think this problem exists is that I displayed averages of another variable over each bin, and given the value of that label, the fill I end up with shouldn't be possible.
The set up should be fairly simple; I set a geom_histogram with x = price, assigned the fill (cpm.bins), and modified the x axis scale. As mentioned previously, I added a geom_text to display the average cpm for each bin. I noticed things were wrong and messed around with the data frame.
This is a small sample of the data frame I am using, but I believe it will be enough to demonstrate the problem.
library(lemon)
library(ggplot2)
df1 <- data.table::fread(
"id size price cpm.bin int.ave.cpm p.int
420 12ounce 2.39 Good 32.50 2.4
629 12ounce 2.78 Underperforming 18.00 2.8
940 12ounce 2.49 Non-purchasing 22.00 2.5
1653 12ounce 2.45 Good 22.00 2.5
1660 12ounce 2.45 Good 22.00 2.5
2561 20ounce 2.59 Underperforming 13.65 2.6
2578 20ounce 2.39 Underperforming 26.02 2.4
2580 20ounce 2.39 Underperforming 26.02 2.4
2581 20ounce 2.39 Good 26.02 2.4
2582 20ounce 2.39 Good 26.02 2.4
2583 20ounce 2.39 Good 26.02 2.4
2584 20ounce 2.39 Good 26.02 2.4
2587 20ounce 2.49 Non-purchasing 20.05 2.5
2589 20ounce 2.99 Underperforming 18.13 3.0
2599 20ounce 2.49 Non-purchasing 20.05 2.5
2600 20ounce 2.49 Underperforming 20.05 2.5
2606 20ounce 2.59 Non-purchasing 13.65 2.6
2607 20ounce 2.39 Good 26.02 2.4
2609 20ounce 2.39 Underperforming 26.02 2.4
2629 20ounce 2.49 Non-purchasing 20.05 2.5
"
)
df2 <- data.table::fread(
"id size price cpm.bin int.ave.cpm p.int
629 12ounce 2.78 Underperforming 18.00 2.8
940 12ounce 2.49 Non-purchasing 22.00 2.5
1653 12ounce 2.45 Good 22.00 2.5
1660 12ounce 2.45 Good 22.00 2.5
2561 20ounce 2.59 Underperforming 13.65 2.6
2587 20ounce 2.49 Non-purchasing 20.05 2.5
2589 20ounce 2.99 Underperforming 18.13 3.0
2599 20ounce 2.49 Non-purchasing 20.05 2.5
2600 20ounce 2.49 Underperforming 20.05 2.5
2606 20ounce 2.59 Non-purchasing 13.65 2.6
2629 20ounce 2.49 Non-purchasing 20.05 2.5
2634 20ounce 2.59 Non-purchasing 13.65 2.6
2658 20ounce 2.49 Underperforming 20.05 2.5
2665 20ounce 2.59 Non-purchasing 13.65 2.6
2671 20ounce 2.69 Non-purchasing 21.18 2.7
2673 20ounce 2.69 Good 21.18 2.7
2674 20ounce 2.69 Good 21.18 2.7
2675 20ounce 2.69 Underperforming 21.18 2.7
2676 20ounce 2.69 Good 21.18 2.7
2677 20ounce 2.69 Good 21.18 2.7"
)
when using these data frames for the following ggplot, there is a different fill for the $2.50 bin in the "12 Ounce" plot.
ggplot(df1, aes(x = price)) +
geom_histogram(aes(fill = cpm.bin), binwidth = 0.1, position = position_fill(), stat = "bin") +
facet_rep_wrap(~size, nrow = 3, repeat.tick.labels = TRUE, scales = "free") +
scale_x_continuous(breaks = seq(0, 10, by = 0.1), labels = scales::dollar) +
geom_text(aes(x = p.int, y = 0.5, label=int.ave.cpm), size=4)
the only difference of these subsets is the minimum possible value for p.int. For df1 the minimum is 2.4, and for df2 the minimum is 2.5.
For the $2.50 bin in the "12 Ounce" section, the fill should be 2/3 "good" (blue), and 1/3 "Non-purchasing" (red) no matter what the minimum value of p.int is. What is going on, and how do I fix this so my plots are displaying values accurately and proportionally when I am using my entire data frame?
Thanks.
data.table::fread
that you supply to read your example data- it does not distinguish between '12' and '20' ounce- everything in thesize
column is just 'Ounce'. I corrected this:df1$size[1:5]=paste('12',df1$size[1:5]); df1$size[6:nrow(df1)]=paste('20',df1$size[6:nrow(df1)])
. Then when I run your code I get a very different plot... Does this help you/could it be the source of your problem? – which_command