I've created a logistic regression formula regarding mpg for various makes and models of cars. One variable "origin" was integer with : 1=American, 2=German, 3=Japanese. I converted it to origin.factor<- factor(origin,labels=c("American", "German", "Japanese")
head(origin.factor) [1] American American American American American [6] American Levels: American German Japanese
First, it was suggested I convert "origin" to factor using as.factor and relabel, but I did not see how to pass the label=c("American", "German", "Japanese") with as.factor. Any ideas?
Next, initial Logistic model with all variables yielded this output (sorry the columns are not aligning in this post, but the last column is the p-values in bold for each variable):
auto.mpg.logistic <- glm(mpg.binary~cylinders + displacement + horsepower + weight + acceleration + year + origin.factor, family="binomial") summary(auto.mpg.logistic)
Call: glm(formula = mpg.binary ~ cylinders + displacement + horsepower + weight + acceleration + year + origin.factor, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.44937 -0.08809 0.00577 0.19315 3.03363
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.450793 5.956353 -3.266 0.00109 **
cylinders -0.264169 0.439645 -0.601 0.54793
displacement 0.015568 0.013658 1.140 0.25434
horsepower -0.043081 0.024621 -1.750 0.08017 .
weight -0.005762 0.001376 -4.187 2.83e-05 ***
acceleration 0.012939 0.142921 0.091 0.92786
year 0.495635 0.086155 5.753 8.78e-09 ***
origin.factorGerman 1.971277 0.785573 2.509 0.01210 *
origin.factorJapanese 1.102741 0.713768 1.545 0.12236
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
next I proceeded to remove the variables where the p-value is > 0.05 level of significance to arrive at the following output:
auto.mpg.logistic <- glm(mpg.binary~ horsepower + weight + year + origin.factor, family="binomial") summary(auto.mpg.logistic)
Call: glm(formula = mpg.binary ~ horsepower + weight + year + origin.factor, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2675 -0.0943 0.0080 0.2007 3.2653
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.240055 4.912407 -3.713 0.000205 ***
horsepower -0.042209 0.016441 -2.567 0.010251 *
weight -0.004607 0.000734 -6.276 3.47e-10 ***
year 0.457663 0.075997 6.022 1.72e-09 ***
origin.factorGerman 1.335225 0.529879 2.520 0.011740 *
origin.factorJapanese 0.628677 0.580123 1.084 0.278500
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now the only variable that is still above the 0.05 level of significance is origin.factorJapanese
So the question is, can I somehow remove just origin.factorJapanese and leave in Origin.factorGerman since it is significant?
Or is the appropriate action to remove origin.factor which will eliminate all aspects of this categorical variable from my logistic model (this seems like my only option...)?
I'm new to R and primarily use base R functions as per our class assignments so please consider that in your answers. Thanks,
John