I'm hoping that this question isn't the same as previous questions but in different words. I've tried using the solutions for previous questions but they haven't worked for me, so bear with me!
So I'm having some trouble with my linear regression model output in R. I'm concerned that the model is using an incorrect referent group as part of the interaction term I've placed into the model and even though I've tried to relevel the individual terms before they're placed into the interaction term, I'm not getting the output I expected.
I have a dataset with a continuous and categorical variables. Let's say that variables A and B are continuous and variables C, D, and E are categorical (0 = No, 1 = Yes). The referent groups for the categorical variables has been set at "No" (0). Here's an example:
ID A B C D E
1 53.6 25 No Yes No
2 51.1 12 Yes No Yes
3 50.9 NA Yes Yes No
4 49.3 2 No No No
5 48.1 NA No Yes No
I've tried a couple of different ways to get at the interaction terms, so my models are set up as follows:
lm1 <- lm(A ~ C*D + E + B, data=example)
lm2 <- lm(A ~ C:D + E + B, data=example)
I expected to get an output table listing the regression coefficient, standard error, etc. for the intercept, C alone, D alone, E, B, and then C * D, broken down into 3 of the 4 possible combination groups of that interaction term, minus the combination group that comprised both referent groups ("No" for both C and D, "C_No:D_No").
EXPECTED:
Coefficient Estimate Std. Error t value Pr(>|t|)
Intercept 90.76369 0.54308 167.127 < 2e-16 ***
C_Yes -0.28639 0.62044 -0.462 0.644465
D_Yes -3.01242 1.14733 -2.626 0.008771 **
E_Yes 0.05865 0.01691 3.468 0.000544 ***
B -0.20891 0.35982 -0.581 0.561634
C_No:D_Yes -0.42116 0.47213 2.617 0.01674 *
C_Yes:D_Yes 2.01208 1.43154 1.406 0.160148
C_Yes:D_No -0.02877 0.65271 -0.345 0.672531
For the first model, I got the output for the intercept, C alone, D alone, E, B, and then one combination group of C * D only.
ACTUAL:
Coefficient Estimate Std. Error t value Pr(>|t|)
Intercept 90.76369 0.54308 167.127 < 2e-16 ***
C_Yes -0.28639 0.62044 -0.462 0.644465
D_Yes -3.01242 1.14733 -2.626 0.008771 **
E_Yes 0.05865 0.01691 3.468 0.000544 ***
B -0.20891 0.35982 -0.581 0.561634
C_No:D_Yes -0.42116 0.47213 2.617 0.01674 *
For the second model, I got the output for the intercept, E, B, and then all combination groups of C * D.
ACTUAL:
Coefficient Estimate Std. Error t value Pr(>|t|)
Intercept 90.76369 0.54308 167.127 < 2e-16 ***
E_Yes 0.05865 0.01691 3.468 0.000544 ***
B -0.20891 0.35982 -0.581 0.561634
C_No:D_Yes -0.42116 0.47213 2.617 0.01674 *
C_Yes:D_Yes NA (all not defined because of singularities)
C_Yes:D_No -0.02877 0.65271 -0.345 0.672531
So now my questions are:
1) Is there different code that will give me everything that I want in one model instead of two?
2) Is this model, as is, using C_Yes:D_Yes as the referent group instead of C_No:D_No and that's why I'm getting the error about singularities? My variables are correlated, yes, but not perfectly so I wasn't expecting multicollinearity to be an issue.
3) If the referent group is correct, why am I getting a coefficient estimate for C_No:D_No (the referent group)?
with(example, table(C, D))
return? Are these really coded as 0/1 or Yes/No? Your output doesn't match what I would expect to see fromlm()
– MrFlick