I was trying out linear regression with R using categorical attributes and observe that I don't get a coefficient value for each of the different factor levels I have.
Please see my code below, I have 5 factor levels for states, but see only 4 values of co-efficients.
> states = c("WA","TE","GE","LA","SF")
> population = c(0.5,0.2,0.6,0.7,0.9)
> df = data.frame(states,population)
> df
states population
1 WA 0.5
2 TE 0.2
3 GE 0.6
4 LA 0.7
5 SF 0.9
> states=NULL
> population=NULL
> lm(formula=population~states,data=df)
Call:
lm(formula = population ~ states, data = df)
Coefficients:
(Intercept) statesLA statesSF statesTE statesWA
0.6 0.1 0.3 -0.4 -0.1
I also tried with a larger data set by doing the following, but still see the same behavior
for(i in 1:10)
{
df = rbind(df,df)
}
EDIT : Thanks to responses from eipi10, MrFlick and economy. I now understand one of the levels is being used as reference level. But when I get a new test data whose state's value is "GE", how do I substitute in the equation y=m1x1+m2x2+...+c ?
I also tried flattening out the data such that each of these factor levels gets it's separate column, but again for one of the column, I get NA as coefficient. If I have a new test data whose state is 'WA', how can I get the 'population value'? What do I substitute as it's coefficient?
> df1
population GE MI TE WA 1 1 0 0 0 1 2 2 1 0 0 0 3 2 0 0 1 0 4 1 0 1 0 0
lm(formula = population ~ (GE+MI+TE+WA),data=df1)
Call:
lm(formula = population ~ (GE + MI + TE + WA), data = df1)
Coefficients:
(Intercept) GE MI TE WA
1 1 0 1 NA
states="GE"
is the intercept. In a model with an intercept, one of the factor levels has to be the "reference" level. All of the other coefficients forstates
are relative to"GE"
. – eipi10relevel
:df$states = relevel(df$states, ref = "LA")
. – eipi10lm(formula = population ~ states-1, data = df)
– MrFlickpredict()
. Your "solution" of creating indicator variables for all states is invalid because your model is over specified and therefore un-estimable. This is a basic feature of regression with categorical variables. You might want to pick up a basic statistics text book to learn more. This is not a programming question anymore. – MrFlick