6
votes

I'd like to plot data such that on y axis there would be probability (in range [0,1]) and on x-axis I have the data values. The data is contiguous (also in range [0,1]), thus I'd like to use some kernel density estimation function and normalize it such that the y-value at some point x would mean the probability of seeing value x in input data.

So, I'd like to ask:

a) Is it reasonable at all? I understand that I cannot have probability of seeing values I do not have in the data, but I just would like to interpolate between points I have using a kernel density estimation function and normalize it afterwards.

b) Are there any built-in options in ggplot I could use, that would override default behavior of geom_density() for example for doing this?

Thanks in advance,

Timo

EDIT: when i said "normalize" before, I actually meant "scale". But I got the answer, so thanks guys for clearing up my mind about this.

3
I'm not sure what you mean by plotting probability but "not density" yet you mention wanting to kernel smooth the data. What the kernel does is turn an empirical distribution (i.e. the histogram) into a smoothed density function (i.e. the PDF). I think you have to let go of either the kernel smoother requirement or the desire to not plot density. Although you may be addressing this when you say "normalize it afterward."JD Long
Thank you for clearing up my mind. I think that just using an histogram is more appropriate in this case then. If I tried to plot probability with smoothing, I guess I would confuse anybody trying to interpret the plots.Timo

3 Answers

11
votes

This isn't a ggplot answer, but if you want to bring together the ideas of kernel smoothing and histograms you could do a bootstrapping + smoothing approach. You'll get beat about the head and shoulders by stats folks for doing ugly things like this, so use at your own risk ;)

start with some synthetic data:

set.seed(1)
randomData <- c(rnorm(100, 5, 3), rnorm(100, 20, 3) )
hist(randomData, freq=FALSE)
lines(density(randomData), col="red")

enter image description here

The density function has a reasonably smart bandwidth calculator which you can borrow from:

bw <- density(randomData)$bw
resample <- sample( randomData, 10000, replace=TRUE)

Then use the bandwidth calc as the SD to make some random noise

noise <- rnorm(10000, 0, bw)
hist(resample + noise, freq=FALSE)
lines(density(randomData), col="red")

enter image description here

Hey look! A kernel smoothed histogram!

I know this long response is not really an answer to your question, but maybe it will provide some creative ideas on how to abuse your data.

12
votes

Just making up a quick merge of @JD Long's and @yesterday's answers:

ggplot(df, aes(x=x)) +
  geom_histogram(aes(y = ..density..), binwidth=density(df$x)$bw) +
  geom_density(fill="red", alpha = 0.2) +
  theme_bw() +
  xlab('') +
  ylab('')

enter image description here

This way the binwidth for ggplot2 was calculated by the density function, and also the latter is drawn on the top of a histogram with a nice transparency. But you should definitely look into stat_densitiy as @yesterday suggested for further customization.

7
votes

You can control the behaviour of density / kernel estimation in ggplot by calling stat_density() rather than geom_density().

See the on-line user manual: http://had.co.nz/ggplot2/stat_density.html You can specify any of the kernel estimation functions that are supported by by stats::density()

library(ggplot2)
df <- data.frame(x = rnorm(1000))
ggplot(df, aes(x=x)) + stat_density(kernel="biweight")

enter image description here