2
votes

When plotting histplot with default stats (density) and KDE flag set to True, the area under the curve is equal to 1. From the Seaborn documentation:

"The units on the density axis are a common source of confusion. While kernel density estimation produces a probability distribution, the height of the curve at each point gives a density, not a probability. A probability can be obtained only by integrating the density across a range. The curve is normalized so that the integral over all possible values is 1, meaning that the scale of the density axis depends on the data values."

Below is the example of density histplot with default KDE normalized to 1.

enter image description here

However, you can also plot a histogram with stats as count or probability. Plotting KDE on top of those will produce the below:

enter image description here enter image description here

How is the KDE normalized? The area certainly is not equal to 1, but is has to be somehow normalized. I could not find this in the docs, the only explanation regards KDE plotted for density histogram. Any help appreciated here, thank you!

2

2 Answers

3
votes

Well, the kde has an area of 1. To draw a kde which matches the histogram, the kde needs to be multiplied by the area of the histogram.

For a density plot, the histogram has an area of 1, so the kde can be used as-is.

For a count plot, the sum of the histogram heights will be the length of the given data (each data item will belong to exactly one bar). The area of the histogram will be that total height multiplied by the width of the bins. (When the bins wouldn't have equal widths, adjusting the kde would be quite tricky).

For a probability plot, the sum of the histogram heights will be 1 (for 100 %). The total area will be the bin_width multiplied by the heights, so equal to the bin_width.

Here is some code to explain what's going on. It uses standard matplotlib bars, numpy to calculate the histogram and scipy for the kde:

import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
import numpy as np

data = [115, 127, 128, 145, 160]
bin_values, bin_edges = np.histogram(data, bins=4)
bin_width = bin_edges[1] - bin_edges[0]
total_area = bin_width * len(data)

kde = gaussian_kde(data)
x = np.linspace(bin_edges[0], bin_edges[-1], 200)

fig, axs = plt.subplots(ncols=3, figsize=(14, 3))
kws = {'align': 'edge', 'color': 'dodgerblue', 'alpha': 0.4, 'edgecolor': 'white'}
axs[0].bar(x=bin_edges[:-1], height=bin_values / total_area, width=bin_width, **kws)
axs[0].plot(x, kde(x), color='dodgerblue')
axs[0].set_ylabel('density')

axs[1].bar(x=bin_edges[:-1], height=bin_values / len(data), width=bin_width, **kws)
axs[1].plot(x, kde(x) * bin_width, color='dodgerblue')
axs[1].set_ylabel('probability')

axs[2].bar(x=bin_edges[:-1], height=bin_values, width=bin_width, **kws)
axs[2].plot(x, kde(x) * total_area, color='dodgerblue')
axs[2].set_ylabel('count')

plt.tight_layout()
plt.show()

calculating seaborn histograms with kde

1
votes

As far as I understand it, the KDE (kernel density estimation) is simply smoothing the curve formed from the data points. What changes between the three representations is the values from which it is computed :

  • With density estimation, the total area under the KDE curve is 1 ; which means you can estimate the probability of finding a value between two bounding values with an integral computation. I think they smooth the data points with a curve, compute the area under the curve and divide all the values by the area so that the curve keeps the same look but the area becomes 1.

  • With probability estimation, the total area under the KDE curve does not matter : each category has a certain probability (e.g. P(x in [115; 125]) = 0.2) and the sum of the probabilities for each category is equal to 1. So instead of computing the area under the KDE curve, they would count all the samples and divide each bin's count by the total.

  • With the counting estimation, you get a standard bin/count distribution and the KDE is just smoothing the numbers so that you can estimate the distribution of values - so that you can estimate how your observations might look like if you take more measures or use more bins.

So all in all, the KDE curve stays the same : it is a smoothing of the sample data distribution. But there is a factor that is applied on the sample values based on what representation of the data you are interested in.

However, take what I am writing with a grain of salt : I think I am not far from the truth, from a mathematical point of view, but maybe someone could explain it with more precise terms - or correct me if I'm wrong.

Here is some reading about Kerneld density estimation : https://en.wikipedia.org/wiki/Kernel_density_estimation ; but for short, this is a smoothing method with some special methematical properties depending on the parameters used.