0
votes

I have a data file ( 1 million rows) that has one outcome variable as Status ( Yes / no ) with three continuous variables and 5 nominal variables ( 5 categories in each variable ) I want to predict the outcome i.e status. I wanted to know which type of analysis is good for building up the model. I have seen logit, probit, logistic regression. I am confused on what to start and analyse the variables that are more likely useful for analysis.

data file: gender,region,age,company,speciality,jobrole,diag,labs,orders,status

M,west,41,PA,FPC, Assistant,code18,27,3,yes

M,Southwest,65,CV,FPC,Worker,code18,69,11,no

M,South,27,DV,IMC,Assistant,invalid,62,13,no

M,Southwest,18,CV,IMC,Worker,code8,6,1,yes

PS: Using R language. Any help would be greatly appreciated Thanks !

1
If you need help with model selection, you should ask over at Cross Validated where statistical questions are on topic (it doesn't matter that you want to do this "in R"). Once you know what model to use, then you should be able to search how to do it in R.MrFlick
Try searching for multiple regression with dummy variables, this question is better suited for cross-validation.Waqas
Decision tree algorithms like C5.0 can be quite powerful in binary classification tasks involving a combination of continuous and nominal variables.RHertel

1 Answers

2
votes

Given the three, most usually start their analysis with Logistic regression.

Note that, Logistic and Logit are the same thing.

While deciding between Logistic and Probit, go for Logistic.

Probit usually returns results faster, while Logistic has a better edge for interpretation result.

Now, to settle on variables - You can vary the number of variables that you are going to use in your model.

model1 <- glm(status ~., data = df, family = binomial(link = 'logit'))

Now, check the model summary and check the importance of predictor variables.

model2 <- glm(status ~ gender + region + age + company + speciality + jobrole + diag + labs, data = df, family = binomial(link = 'logit'))

With reducing the number of variables you would better be able to identify what variables are important.

Also, ensure that you have performed data cleaning prior to this.

Avoid including highly correlated variables, you can check them using cor()