1
votes

In my Multiple Linear Regression model, Y is the dependent variable and PERCT_A, PERCT.B, PERCT_C, PERCT_D are independent variables corresponding to percentages of different age groups. The sum of these 4 variables in each row is 100%. Is it correct to fix one among them base and do a multiple linear regression? I ran the model, and the coefficients got are logical, making sense. However, how to interpret the coefficients?

1

1 Answers

0
votes

If the four variables sum to 1, you can either (i) include a constant in the regression, and drop one of the 4 vars; or (ii) remove the constant, and include all 4 vars. Your fitted values should be identical in both cases. The R^2 will differ because the definition of R^2 changes depending on whether you include a constant, but, again, the fitted values (and hence residuals) will be identical in the two models.

Here's an example with 3 vars that sum to 1.0:

n <- 10^3
df <- data.frame(fraction.children <- runif(n))
df$fraction.adults <- runif(n, max=1 - df$fraction.children)
df$fraction.grandparents <- 1 - df$fraction.children - df$fraction.adults

summary(df)  # All vars are in [0, 1]
isTRUE(all.equal(rowSums(df), rep(1, nrow(df))))  # True, the fraction vars sum to 1

df$y <- rnorm(n) + 2*df$fraction.children + 5*df$fraction.adults

m1 <- lm(y ~ 0 + fraction.children + fraction.adults + fraction.grandparents, data=df)
m2 <- lm(y ~ 1 + fraction.children + fraction.adults, data=df)

summary(m1)
summary(m2)

## The R^2 are different (definition changes, see ?summary.lm)
## ...but residual standard errors are the same

isTRUE(all.equal(predict(m1, newdata=df), predict(m2, newdata=df)))  # True

In terms of interpretation, I'd look at the model with no constant, and look at differences between the coefficients. For example, if fraction.children increases by 0.10 at the expense of fraction.adults (the sum has to remain constant), the effect on the predicted y will be 0.10 * (beta.children - beta.adults).