0
votes

I'm a little confused about how to interpret coefficient in multiple regression with two categorical variables. Use mtcars dataset as an example. According to some online sources and books, the coefficient of one categorical variable is the different of mean between the level and reference level, given the other variable is at reference level. In this example, according to the aggregated result, the coefficient of factor(vs)1 should be 81-91=-10, but it's not. It's -13.92. Those claims seems to be wrong.
Can someone clarify one on this? How to interpret the coefficients in terms of 'mean difference'?

fit <- lm(data=df, hp~factor(vs)+factor(cyl))
Call:
lm(formula = hp ~ factor(vs) + factor(cyl), data = df)

Coefficients:
 (Intercept)   factor(vs)1  factor(cyl)6  factor(cyl)8  
       95.29        -13.92         34.95        113.93  

# then mean of hp at different levels of vs ans cyl.
aggregate(hp~vs+cyl, df, mean)

0   4   91.0000     
1   4   81.8000     
0   6   131.6667        
1   6   115.2500        
0   8   209.2143

My second question is: what if the treat those categorical variable as ordered factors? There will be linear or quadratic term for those factors. But how should I interpret the coefficients?

lm(data=df, hp~factor(vs, ordered=TRUE)+factor(cyl, ordered=TRUE))
Call:
lm(formula = hp ~ factor(vs, ordered = TRUE) + factor(cyl, ordered = TRUE), 
    data = df)

Coefficients:
                  (Intercept)   factor(vs, ordered = TRUE).L  
                       137.96                          -9.84  
factor(cyl, ordered = TRUE).L  factor(cyl, ordered = TRUE).Q  
                        80.56                          17.97  

Thank you very much in advance.

1

1 Answers

0
votes

Regarding the first question, if

  • cyl is at its reference level and vs is at the 1 level then the mean they are referring to is 95.29 - 13.92 + 0 and when
  • vs and cyl are both at the reference level the mean is 95.29 + 0 + 0

so -13.92 is the difference between those two means.

By mean they are referring to the expected value of y which is estimated by the predicted value. If we write the regression equation as y = terms + residuals then the expected value of y equals the terms, i.e.

E(y) = E(terms + residuals)
     = E(terms) + E(residuals)
     = terms + 0    <- because terms is not random and residuals have mean 0
     = terms

Regarding the second question which asks about ordered factors they are rarely used and I would ignore their existence for linear models. In the book Introductory Statistics with R by Peter Dalgaard, he mentions that the implementation in R assumes that the levels are equidistant. Such assumption is questionable in general.