Chi square goodness of fit for random numbers generated

Question

I have used Inverse CDF method to generate 1000 samples from an exponential and a Cauchy random variable.

Now to verify whether these belong to their relevant distributions, I have to perform Chi-Squared Test for Goodness of fit.

I have tried two approaches (as below) -

Chisq.test(y) #which has 1000 samples from supposed exponential distribution
```
chisq.test(z) #cauchy
```

I am getting the following error:

data: y X-squared = 234.0518, df = 999, p-value = 1

 Warning message:
 In chisq.test(y) : Chi-squared approximation may be incorrect
  chisq.test(z)
 Error in chisq.test(z) : 
  all entries of 'x' must be nonnegative and finite

I downloaded the vcd library to use goodfit() and typed:

    t1 <- goodfit(y,type= "exponential",method= "MinChiSq")
    summary(t1)

In this case, the error message:

   Error: could not find function "goodfit"

can somebody please guide on how to implement the Chi-Squared GOF test properly?

Note: The samples are not from normal distribution (exponential and cauchy respectively) I am trying to understand if it is possible to get the observed and expected data instead with no luck so far.

edit - I did type in library(vcd) before writing the rest of the code. Apologies to have assumed it was obvious.

Usually for chi squared test you need "expected frequency" which might mean you should bin your data into bins of appropriate width (maybe 10 per bin). Then you have to compare "observed" and "expected" - in other words you need to give the probability p as another factor in the chisq function. See ww2.coastal.edu/kingw/statistics/R-tutorials/goodness.html for simple example. — Floris
Did you ever actually load the package? Just downloading using install.packages doesn't load the package. You'll need to use library(vcd) to load it. — Dason

jlhoward jlhoward · Accepted Answer · 2014-02-09T21:40:06

The chisq.test(...) function is designed primarily for use with counts, so it expects its arguments to be either countable (using table(...) for example), or to be counts already. It basically creates a contingency table for x and y (the first two arguments) and then uses the chisq test to determine if they are from the same distribution.

You are probably better off using the Kolmogorov–Smirnov test, which is designed for problems like yours. The K-S test compares the ecdf of the sample to the cdf of the test distribution and tests the null hypothesis that they are the same.

set.seed(1)
df <- data.frame(y = rexp(1000),
                 z = rcauchy(1000, 100, 100))

ks.test(df$y,"pexp")
# One-sample Kolmogorov-Smirnov test
#
# data:  df$y
# D = 0.0387, p-value = 0.1001
# alternative hypothesis: two-sided

ks.test(df$z,"pcauchy",100,100)    
#  One-sample Kolmogorov-Smirnov test
# 
# data:  df$z
# D = 0.0296, p-value = 0.3455
# alternative hypothesis: two-sided

Note that in this case, the K-S test predicts a 90% chance that your sample df$y did not come from an exponential distribution, even though it clearly did.

You can use chisq.test(...) by artificially binning your data and then comparing the counts in each bin to what would be expected from your test distribution (using p=...), but this is convoluted and the answer you get depends on the number of bins.

breaks <- c(seq(0,10,by=1))
O <- table(cut(df$y,breaks=breaks))
p <- diff(pexp(breaks))
chisq.test(O,p=p, rescale.p=T)
#   Chi-squared test for given probabilities
# 
# data:  O
# X-squared = 7.9911, df = 9, p-value = 0.535

In this case the chisq test predicts a 47% chance that your sample did not come from an exponential distribution.

Finally, even though they are qualitative, I find Q-Q plots to be very useful. These plot quantiles of your sample against quantiles of the test distribution. If the sample is drawn from the test distribution, the Q-Q plot should fall close to the line y=x.

par(mfrow=c(1,2))
plot(qexp(seq(0,1,0.01)),quantile(df$y,seq(0,1,0.01)),
     main="Q-Q Plot",ylab="df$Y", xlab="Exponential",
     xlim=c(0,5),ylim=c(0,5))
plot(qcauchy(seq(0,.99,0.01),100,100),quantile(df$z,seq(0,.99,0.01)),
     main="Q-Q Plot",ylab="df$Z",xlab="Cauchy",
     xlim=c(-1000,1000),ylim=c(-1000,1000))

enter image description here

Looking at the Q-Q plots gives me much more confidence in asserting that df$y and df$z are drawn, respectively, from the Exponential and Cauchy distributions than either the K-S or ChiSq tests, even though I can't put a number on it.

Chi square goodness of fit for random numbers generated

2 Answers