0
votes

I'm modelling a university applicants dataset using PROC LOGISTIC in SAS (9.2). The target variable is 'Enrolled y/n', and i'm modelling against a range of 13 variables (a mixture of indicator, continuous and class) including: Number of applications submitted, number of events attended, Applicant Age, etc.

I'm using 50% of the whole dataset to train the model, which gives me a sample size (for training) of just under 15,000 observations.

When I run PROC LOGISTIC, the output is reporting that the majority of the variables are highly significant at <.0001.

The 'Testing Global Null Hypothesis: BETA=0' statistics also report that the model is good at <.0001, and the Association Statistics table is reporting that a high percentage (90%+) of predicted probabilities are concordant.

All of which seems great, until I get to the deviance statistics for Goodness of Fit. Pearson, Deviance, and Hossmer/Lemeshow tests all also report a Pr>ChiSq value of <.0001. If i'm interpreting this correctly (referencing Paul Allison), this significance level means that the model should be rejected on the grounds of poor fit.

I've tried using STEPWISE to reduce the model, but this only resulted in rejection of a single factor, and the GOF stats were unaffected.

Suspecting multicollinearity, I've tried modelling just single effects against the dependent variable, but i'm still getting similar results - high significance on the parameter estimate p-values, but also the high significance value in the GOF tests...

Is there something fundamentally wrong with my model - or am I misinterpreting the GOF tests under the circumstances? Please can anyone advise what I need to investigate?

The Code I've been running for just a single effect but which is producing the same problematic results as for the model including all the factors:

/*Applicant_Factors_TRAIN: Single Factor*/
proc logistic DATA=Applicant_Factors_TRAIN;
MODEL Applicant_Enrolled(EVENT='1')= Total_campus_visits/ AGGREGATE SCALE = NONE LACKFIT RSQ;
run;

Output below:

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

2
Can you post the output? This isn't a code question though, it sounds like methodology or interpretation issues so it may be better suited on stats.stackexchange.comReeza
Hi @Reeza. You're probably correct. I've added the output screenshots - thanks for your help.Jon295087
Yes, this needs to be posted in stats.stackexchange.com.Reeza
It's quite possible for an effect (or all effects) to be statistically significant, and yet to have only a minor influence on the predictive power of the model. Especially if you have a lot of data.david25272

2 Answers

0
votes

I Current State of the model

  1. Your model has very high % of ties (Percent Tied = 32.8%) and only 56.7% of concordance. So, the accuracy of your model is not good. 'c' value of 0.729 or 72.9% is not giving the true picture of your model. 'c' value or AUC = (0.567+(0.328/2)). This means your c value got biased by the high % of ties.

  2. High % Ties lot of you data points are leading to the predictions which are at boundary i.e., neither 0 or 1.

II So, what you need to do:-

You need to reduce % ties and try to achieve high concordance, high accuracy and thus acceptable GOF parameters

III How can you improve your model

  1. One of the things which I always do, usual things like Stepwise or Backward selection fails is to use BOOTSTRAPPING. One of the most simple and robust algorithm which helps us get the most relevant variables using multiple samples, thus helping in increasing the accuracy and stability of the model

  2. If BOOTSTRAPPING also fails, then try Machine Learning techniques like Random forest, GBM etc. Random Forest is considered to be one of the most robust algorithm for classification

  3. If individual algorithms also fail, then try stacking. Simply a combination of predictive power of multiple classification algorithms. NOTE:- Use this when all the above options have been exhausted.

Hope this helps:-)

0
votes

I've since read that Goodness of Fit tests become inherently problematic when dealing with large datasets, and consequently, based on the literature, have chosen to disregard them in favour of a cross validation approach that considers p-value, test of null hypothesis, ROC AUC c statistics, confidence intervals and Max-rescaled R-Square statistic (Johnson and Wichern 1992) (Shmueli, Lin and Lucas Jr. 2013) (Li 2013)