4
votes

I was running a regression using categorical variables and came across this question. Here, the user wanted to add a column for each dummy. This left me quite confused because I though having long data with the column including all the dummies stored using as.factor() was equivalent to having dummy variables.

Could someone explain the difference between the following two linear regression models?

Linear Model 1, where Month is a factor:

dt_long
          Sales Period Month
   1: 0.4898943      1    M1
   2: 0.3097716      1    M1
   3: 1.0574771      1    M1
   4: 0.5121627      1    M1
   5: 0.6650744      1    M1
  ---                       
8108: 0.5175480     24   M12
8109: 1.2867316     24   M12
8110: 0.6283875     24   M12
8111: 0.6287151     24   M12
8112: 0.4347708     24   M12

M1 <- lm(data = dt_long,
         fomrula = Sales ~ Period + factor(Month)

Linear Model 2 where each month is an indicator variable:

    dt_wide
          Sales Period M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
   1: 0.4898943      1  1  0  0  0  0  0  0  0  0   0   0   0
   2: 0.3097716      1  1  0  0  0  0  0  0  0  0   0   0   0
   3: 1.0574771      1  1  0  0  0  0  0  0  0  0   0   0   0
   4: 0.5121627      1  1  0  0  0  0  0  0  0  0   0   0   0
   5: 0.6650744      1  1  0  0  0  0  0  0  0  0   0   0   0
  ---                                                        
8108: 0.5175480     24  0  0  0  0  0  0  0  0  0   0   0   1
8109: 1.2867316     24  0  0  0  0  0  0  0  0  0   0   0   1
8110: 0.6283875     24  0  0  0  0  0  0  0  0  0   0   0   1
8111: 0.6287151     24  0  0  0  0  0  0  0  0  0   0   0   1
8112: 0.4347708     24  0  0  0  0  0  0  0  0  0   0   0   1

M2 <- lm(data = data_wide,
         formula = Sales ~ Period + M1 + M2 + M3 + ... + M11 + M12

Judging by this previously asked question, both models seem exactly the same. However, after running both models, I noticed that M1 returns 11 dummy estimators (because M1 is used as the reference level), while M2 returns 12 dummies.

Is one model better than the other? Is M1 more efficien? Can I set the reference level in M1 to make both models exactly equivalent?

2

2 Answers

5
votes

Defining a model as in M1 is just a shortcut of including dummy variables: if you wanted to compute those regression coefficients by hand, clearly they'd have to be numeric.

Now something that perhaps you didn't notice about M2 is that one of the dummies should have a NA coefficient. That is because you manually included all of them and left the intercept. In this way we have a perfect collinearity issue. By not including one of the dummies or adding -1 to eliminate the constant term everything would be fine.

Some examples. Let

y <- rnorm(100)
x0 <- rep(1:0, each = 50)
x1 <- rep(0:1, each = 50)
x <- factor(x1)

In this way x0 and x1 is a decomposition of x. Then

## Too much
lm(y ~ x0 + x1)

# Call:
# lm(formula = y ~ x0 + x1)

# Coefficients:
# (Intercept)           x0           x1  
#    -0.15044      0.07561           NA  

## One way to fix it
lm(y ~ x0 + x1 - 1)

# Call:
# lm(formula = y ~ x0 + x1 - 1)

# Coefficients:
#       x0        x1  
# -0.07483  -0.15044  

## Another one
lm(y ~ x1)

# Call:
# lm(formula = y ~ x1)

# Coefficients:
# (Intercept)           x1  
#    -0.07483     -0.07561  

## The same results
lm(y ~ x)

# Call:
# lm(formula = y ~ x)

# Coefficients:
# (Intercept)           x1  
#    -0.07483     -0.07561  

Ultimately all the models contain the same amount of information, but in the case of multicollinearity we face the issue of identification.

1
votes
  1. Improper dummy coding.

When you change a categorical variable into dummy variables, you will have one fewer dummy variable than you had categories. That’s because the last category is already indicated by having a 0 on all other dummy variables. Including the last category just adds redundant information, resulting in multicollinearity. So always check your dummy coding if it seems you’ve got a multicollinearity problem.