In my Multiple Linear Regression model, Y is the dependent variable and PERCT_A, PERCT.B, PERCT_C, PERCT_D are independent variables corresponding to percentages of different age groups. The sum of these 4 variables in each row is 100%. Is it correct to fix one among them base and do a multiple linear regression? I ran the model, and the coefficients got are logical, making sense. However, how to interpret the coefficients?
1 Answers
If the four variables sum to 1, you can either (i) include a constant in the regression, and drop one of the 4 vars; or (ii) remove the constant, and include all 4 vars. Your fitted values should be identical in both cases. The R^2 will differ because the definition of R^2 changes depending on whether you include a constant, but, again, the fitted values (and hence residuals) will be identical in the two models.
Here's an example with 3 vars that sum to 1.0:
n <- 10^3
df <- data.frame(fraction.children <- runif(n))
df$fraction.adults <- runif(n, max=1 - df$fraction.children)
df$fraction.grandparents <- 1 - df$fraction.children - df$fraction.adults
summary(df) # All vars are in [0, 1]
isTRUE(all.equal(rowSums(df), rep(1, nrow(df)))) # True, the fraction vars sum to 1
df$y <- rnorm(n) + 2*df$fraction.children + 5*df$fraction.adults
m1 <- lm(y ~ 0 + fraction.children + fraction.adults + fraction.grandparents, data=df)
m2 <- lm(y ~ 1 + fraction.children + fraction.adults, data=df)
summary(m1)
summary(m2)
## The R^2 are different (definition changes, see ?summary.lm)
## ...but residual standard errors are the same
isTRUE(all.equal(predict(m1, newdata=df), predict(m2, newdata=df))) # True
In terms of interpretation, I'd look at the model with no constant, and look at differences between the coefficients. For example, if fraction.children increases by 0.10 at the expense of fraction.adults (the sum has to remain constant), the effect on the predicted y will be 0.10 * (beta.children - beta.adults).