2
votes

I’m new to R, but learning all I can. I receive the message below when plotting a facetted density plot histogram.

Warning message: Removed ### rows containing missing values (geom_bar).

I’ve read the message may be due to an x-axis issue not showing all data points, however, after investigation it doesn’t appear so.

The error seems to be in the "position = "fill" " section of the geom_histogram. When the position = "fill" is removed no errors are produced.

Any help or advice is greatly appreciated.

library(ggplot2)

df <- data.frame(recividism = sample(0:2, 100, replace = T),
                 TotalDays = sample(15:1000, 100, replace = T),
                 NumEnroll = sample(1:7, 100, replace = T))

df$recividism <- as.factor(df$recividism)

levels(df$recividism) <- c("Not Perm", "Perm", "Rec")

ggplot(data = df, aes(x = TotalDays)) + 
  geom_histogram(aes(fill = recividism), position = "fill" ) + 
  facet_grid(facets = NumEnroll ~ .)
1

1 Answers

2
votes

It's a bit confusing, I agree. Clearly, there is no real missing data. And as you said, if you stack instead of fill, all is fine.

What is happening, is that ggplot internally makes a table for each x by group (color in this case) by facet combination. In the case of stack, the combinations without any values are silently not plotted. This is because they stack up to zero. But for fill there is a division by the total, and since the total is 0 the answers would be Inf, but are set to NA instead.

You can see this:

p <- ggplot(data = df, aes(x = TotalDays)) + 
  geom_histogram(aes(fill = recividism), position = "fill" ) + 
  facet_grid(facets = NumEnroll ~ .)

head(ggplot_build(p)$data[[1]], 10)

Shows:

      fill  y count         x      xmin      xmax     density ncount ndensity PANEL group ymin ymax colour size linetype alpha
1  #619CFF NA     0  33.86207  16.93103  50.79310 0.000000000      0   0.0000     1     3   NA   NA     NA  0.5        1    NA
2  #00BA38 NA     0  33.86207  16.93103  50.79310 0.000000000      0   0.0000     1     2   NA   NA     NA  0.5        1    NA
3  #F8766D NA     0  33.86207  16.93103  50.79310 0.000000000      0   0.0000     1     1   NA   NA     NA  0.5        1    NA
4  #619CFF NA     0  67.72414  50.79310  84.65517 0.000000000      0   0.0000     1     3   NA   NA     NA  0.5        1    NA
5  #00BA38 NA     0  67.72414  50.79310  84.65517 0.000000000      0   0.0000     1     2   NA   NA     NA  0.5        1    NA
6  #F8766D NA     0  67.72414  50.79310  84.65517 0.000000000      0   0.0000     1     1   NA   NA     NA  0.5        1    NA
7  #619CFF  1     1 101.58621  84.65517 118.51724 0.007382892      1 135.4483     1     3    0    1     NA  0.5        1    NA
8  #00BA38  1     0 101.58621  84.65517 118.51724 0.000000000      0   0.0000     1     2    1    1     NA  0.5        1    NA
9  #F8766D  1     0 101.58621  84.65517 118.51724 0.000000000      0   0.0000     1     1    1    1     NA  0.5        1    NA
10 #619CFF NA     0 135.44828 118.51724 152.37931 0.000000000      0   0.0000     1     3   NA   NA     NA  0.5        1    NA

You can see there's NAs for y in many rows, and that is where you get the warning from.

Hope that clears things up a bit. And so it turns out that it does indeed happen because you don't have data for some levels of x. But that's not the full story.