1
votes

I can easily make a stacked histogram using ggplot2 with counts on the y-axis. What I want is to convert this plot in terms of density. I can do this by adding the aes(y=..density..) into the geom_histogram layer; but what ggplot does is plot the density of each individual data series to have a total area of 1 each. So if you are stacking 4 data series in one histogram the total area of the bars will be 4.

What I am after is to plot the stacked histogram in terms of density but to have all of the data series considered in calculating the density. In other words... I want the density stacked histogram to have the same proportion bars as the counts histogram.

3
Help us help you by providing a reproducible example.eipi10

3 Answers

1
votes
library(ggplot2)
dtDataset = data.frame(
   V1 = c('a','b'),
   V2 = runif(20)
)

ggplot(dtDataset) + 
   geom_density(aes(x = V2, group = V1), position = 'stack')
1
votes

I found a way to do this that involves computing a binwidth = bw, say, and setting the y variable to be (..count..)/(n*bw), where n is the number of data points.

Generate some toy data

    require(ggplot2)

    set.seed(1234)
    x1 <- rnorm(10000, 0, 1)
    x2 <- rnorm(90000, 1, 1)
    X <- data.frame(x = c(x1, x2), 
                    Class = as.factor(c(rep(1, length(x1)), rep(2, length(x2)))))

Calculate n and binwidth

    n <- dim(X)[1]
    bw <- 3.49 * sd(X[, "x"]) * dim(X)[1]^(-1/3)  ## binwidth using Scott's rule.

Generate the plot

    p1 <- ggplot(data = X, aes(x = x, bw = bw, n = n)) + 
            geom_histogram(aes(y = (..count..)/(n * bw), fill = Class), 
                           binwidth = bw) + 
            geom_density()

    print(p1)

Now each bin is colored according to the proportion of contained points in each class and fits the definition of a density as given by the black line.

multi-class histogram

0
votes

You can calculate the frequency density by yourself as already mentioned, but you can calculate the variables for total counts n and bin width within ggplot. counts n are simply the sum of counts and for the bin width, you can use the internal variable width. If you want the relative frequency instead of the frequency density, just do not divide by width.

library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = stat(count / sum(count) / width), fill = Species)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2020-04-30 by the reprex package (v0.3.0)