How to reduce a categorical variable in a logistic regression model in R

Question

I've created a logistic regression formula regarding mpg for various makes and models of cars. One variable "origin" was integer with : 1=American, 2=German, 3=Japanese. I converted it to origin.factor<- factor(origin,labels=c("American", "German", "Japanese")

head(origin.factor) [1] American American American American American [6] American Levels: American German Japanese

First, it was suggested I convert "origin" to factor using as.factor and relabel, but I did not see how to pass the label=c("American", "German", "Japanese") with as.factor. Any ideas?

Next, initial Logistic model with all variables yielded this output (sorry the columns are not aligning in this post, but the last column is the p-values in bold for each variable):

auto.mpg.logistic <- glm(mpg.binary~cylinders + displacement + horsepower + weight + acceleration + year + origin.factor, family="binomial") summary(auto.mpg.logistic)

Call: glm(formula = mpg.binary ~ cylinders + displacement + horsepower + weight + acceleration + year + origin.factor, family = "binomial")

Deviance Residuals: Min 1Q Median 3Q Max
-2.44937 -0.08809 0.00577 0.19315 3.03363

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.450793 5.956353 -3.266 0.00109 **

cylinders -0.264169 0.439645 -0.601 0.54793
displacement 0.015568 0.013658 1.140 0.25434
horsepower -0.043081 0.024621 -1.750 0.08017 .
weight -0.005762 0.001376 -4.187 2.83e-05 ***

acceleration 0.012939 0.142921 0.091 0.92786
year 0.495635 0.086155 5.753 8.78e-09 ***

origin.factorGerman 1.971277 0.785573 2.509 0.01210 *

origin.factorJapanese 1.102741 0.713768 1.545 0.12236

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

next I proceeded to remove the variables where the p-value is > 0.05 level of significance to arrive at the following output:

auto.mpg.logistic <- glm(mpg.binary~ horsepower + weight + year + origin.factor, family="binomial") summary(auto.mpg.logistic)

Call: glm(formula = mpg.binary ~ horsepower + weight + year + origin.factor, family = "binomial")

Deviance Residuals: Min 1Q Median 3Q Max
-2.2675 -0.0943 0.0080 0.2007 3.2653

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.240055 4.912407 -3.713 0.000205 ***

horsepower -0.042209 0.016441 -2.567 0.010251 *

weight -0.004607 0.000734 -6.276 3.47e-10 ***

year 0.457663 0.075997 6.022 1.72e-09 ***

origin.factorGerman 1.335225 0.529879 2.520 0.011740 *

origin.factorJapanese 0.628677 0.580123 1.084 0.278500

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now the only variable that is still above the 0.05 level of significance is origin.factorJapanese

So the question is, can I somehow remove just origin.factorJapanese and leave in Origin.factorGerman since it is significant?

Or is the appropriate action to remove origin.factor which will eliminate all aspects of this categorical variable from my logistic model (this seems like my only option...)?

I'm new to R and primarily use base R functions as per our class assignments so please consider that in your answers. Thanks,

John

James Curran James Curran · Accepted Answer · 2020-04-04T05:52:27

This is really about statistics more than it is about R. You have a model which has a bunch of continuous explanatory variables (horsepower, weight, year), and a single factor origin.factor. The model you are fitting is a parallel lines model. That is, for each level of origin.factor you are fitting a hyper-plane (but just think about it as a line if it helps) with a different intercept for each country of origin.

R uses the Intercept to fit the base level of your factor, and the remaining factor levels are really the difference between the base level and the level. Therefore what the regression summary table is telling you that German cars are different from American cars (American is the base because it comes first alphabetically which is how R handles factors by default), but Japanese cars are not. Note it tells you nothing about the difference between German and Japanese cars.

So, you have some evidence that there are differences between the levels of the factors, but not all of them. You really don't want to try and fit the model without the Japanese level in there (well you might but not for the reasons you think).

How to reduce a categorical variable in a logistic regression model in R

origin.factorJapanese 1.102741 0.713768 1.545 0.12236

origin.factorJapanese 0.628677 0.580123 1.084 0.278500

1 Answers