3
votes

I have a set of data and fit the corresponding histogram by a lognormal distribution. I first calculate the optimal parameters for the lognormal function, and then plot the histogram and the lognormal function. This gives quite good results:

Histogram in blue, fitting function in red.

import scipy as sp
import numpy as np
import matplotlib.pyplot as plt

num_data = len(data)

x_axis = np.linspace(min(data),
                 max(data),num_data)

number_of_bins = 240
histo, bin_edges = np.histogram(data, number_of_bins, normed=False)

shape, location, scale = sp.stats.lognorm.fit(data)

plt.hist(data, number_of_bins, normed=False);


# the scaling factor scales the normalized lognormal function up to the size
# of the histogram: 
scaling_factor = len(data)*(max(data)-min(data))/number_of_bins

plt.plot(x_axis,scaling_factor*sp.stats.lognorm.pdf(x_axis, shape,
              location,   scale),'r-')

# adjust the axes dimensions:
plt.axis([bin_edges[0]-10,bin_edges[len(bin_edges)-1]+10,0, histo.max()*1.1])

However, when performing the Kolmogorov-Smirnov test on the data versus the fitting function, I get way too low p-values (of the order of e-32):

lognormal_ks_statistic, lognormal_ks_pvalue = 
       sp.stats.kstest(
       data, 
       lambda k: sp.stats.lognorm.cdf(k, shape, location, scale),
       args=(), 
       N=len(data), 
       alternative='two-sided', 
       mode='approx')

print(lognormal_ks_statistic)
print(lognormal_ks_pvalue)

This is not normal, since we see from the plot that the fitting is quite accurate... does anybody know where I made a mistake?

Thanks a lot!! Charles

1

1 Answers

2
votes

This simply means that your data isn't exactly log-normal. Based on the histogram, you have a lot of data points for the K-S test to use. This means that if your data is evenly slightly different than would be expected based on a log-normal distribution with those parameters, the K-S test will indicate the data isn't drawn from log-normal.

Where is the data from? If it is from an organic source, or any source other than specifically drawing random numbers from a lognormal distribution, I would expect an extremely small p-value, even if the fits looks great. This certainly isn't a problem though as long as the fit is sufficiently good for your purposes.