ggplot2 : Extending stat_function to the geom_violin

Question

In a data.frame, I would like to be able to compare the density estimates by ggplot2::geom_violin() with the ones that would be computed with stat_function() and this for every factor.

In this settting, I want to compare the empirical density of 2 samples of size 100 with the true density of normal distributions with mean 10 and 20.


library(tidyverse)

test <- tibble(a = rnorm(100, mean = 10), 
               b = rnorm(100, mean = 20)) %>% 
  gather(key, value)

One way to achieve this is to replicate for every factor an overlay of stat_density and stat_function. However for too many factors this would create too many plots. (multiple answers on these questions exist : e.g. overlay histogram with empirical density and dnorm function)

For the clarity of the next graphs i use the geom_flat_violin of @DavidRobinson : dgrtwo/ geom_flat_violin.R.

source("geom_flat_violin.R")

# without the "true" distribution

test %>% 
  ggplot(aes(x = key, y = value)) +
  geom_flat_violin(col = "red", fill = "red", alpha = 0.3) + 
  geom_point()

# comparing with the "true" distribution

test %>% 
  ggplot(aes(x = key, y = value)) +
  geom_flat_violin(col = "red", fill = "red", alpha = 0.3) + 
  geom_point() +
  geom_flat_violin(data = tibble(value = rnorm(10000, mean = 10), key = "a"),
                   fill = "blue", alpha = 0.2)

The problem with this solution is that it requires to simulate for every factor enough simulated data points so that the final density is smooth enough. For the normal distribution 10000 is enough but for other distributions it might be necessary to simulate even more points.

The question is : can the stat_functions be used to achieve this so that it is not mandatory to simulate data?

  stat_function(fun = dnorm, args = list(mean = 10))
  stat_function(fun = dnorm, args = list(mean = 20))

Allan Cameron Allan Cameron · Accepted Answer · 2020-07-25T15:49:35

Rather than having to calculate the density of a large sample, you could simply get the distribution directly and plot it as a polygon:

library(tidyverse)

test <- tibble(a = rnorm(100, mean = 10), 
               b = rnorm(100, mean = 20)) %>% 
  gather(key, value) 

test %>%
  ggplot(aes(x = key, y = value)) +
  geom_flat_violin(col = "red", fill = "red", alpha = 0.3) + 
  geom_point() +
  geom_polygon(data = tibble(value = seq(7, 13, length.out = 100), 
                             key = 1 + dnorm(value, 10)),
               fill = "blue", colour = "blue", alpha = 0.2)

ggplot2 : Extending stat_function to the geom_violin

1 Answers