2
votes

let's say I have this array A:

array([ 0.0019879 , -0.00172861, -0.00527226,  0.00639585, -0.00242005,
   -0.00717373,  0.00371651,  0.00164218,  0.00034572, -0.00864304,
   -0.00639585,  0.006828  ,  0.00354365,  0.00043215, -0.00440795,
    0.00544512,  0.00319793,  0.00164218,  0.00025929, -0.00155575,
    0.00129646,  0.00259291, -0.0039758 ,  0.00328436,  0.00207433,
    0.0011236 ,  0.00440795,  0.00164218, -0.00319793,  0.00233362,
    0.00025929,  0.00017286,  0.0008643 ,  0.00363008])

If I run:

np.histogram(A, bins=9, density=True)

as hist I get:

array([  34.21952021,   34.21952021,   34.21952021,   34.21952021,
     34.21952021,  188.20736116,  102.65856063,   68.43904042,
     51.32928032])

The manual says:

"If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function."

I thought I had a good understanding of histograms and density functions but I really don't understand what those values represent or how they are calculated.

I need to reproduce those values with R, as I am porting some code between the two languages.

1
One of the nice things about open source software is that if you don't know how something was calculated you can always have a look yourself.Alex
Thanks for the link, it is very interesting but I think it is out of my league for now to investigate how the function is internally built.goingdeep

1 Answers

2
votes

In R, you can use the hist() function to plot your histogram. Additionally, hist is an S3 function that produces a list.

A <- c(0.0019879 , -0.00172861, -0.00527226,  0.00639585, -0.00242005,
        -0.00717373,  0.00371651,  0.00164218,  0.00034572, -0.00864304,
        -0.00639585,  0.006828  ,  0.00354365,  0.00043215, -0.00440795,
        0.00544512,  0.00319793,  0.00164218,  0.00025929, -0.00155575,
        0.00129646,  0.00259291, -0.0039758 ,  0.00328436,  0.00207433,
        0.0011236 ,  0.00440795,  0.00164218, -0.00319793,  0.00233362,
        0.00025929,  0.00017286,  0.0008643 ,  0.00363008)

Here is the default histogram produced by R with your vector A.

hist(A)

Here is the histogram with an additional layer for the density curve.

hist(A, freq = F)
lines(density(A), col = 'red')

enter image description here

Let us store the list hist(A) to p.

p <- hist(A)

We can now see the contents of the list p.

str(p)
# List of 6
#  $ breaks  : num [1:10] -0.01 -0.008 -0.006 -0.004 -0.002 0 0.002 0.004 # 0.006 0.008
#  $ counts  : int [1:9] 1 2 2 3 2 12 8 2 2
#  $ density : num [1:9] 14.7 29.4 29.4 44.1 29.4 ...
#  $ mids    : num [1:9] -0.009 -0.007 -0.005 -0.003 -0.001 0.001 0.003 0.005 0.007
#  $ xname   : chr "A"
#  $ equidist: logi TRUE
#  - attr(*, "class")= chr "histogram"

The density refers to the theoretical density function value. This can exceed 1, but the area under the density curve should be equal to 1. The width of each bar is easily determined by the difference between the breakpoints (breaks) of the bars in the histogram. Thus, if we multiply the width of each bar of the histogram by the p$density, and add the results, we should get a sum of 1.

sum(diff(p$breaks) * p$density)
# [1] 1