I am transitioning from Stata to R. In Stata, if I label a factor levels (say--0 and 1) to (M and F), 0 and 1 would remain as they are. Moreover, this is required for dummy-variable linear regression in most software including Excel and SPSS.
However, I've noticed that R defaults factor levels to 1,2 instead of 0,1. I don't know why R does this although regression internally (and correctly) assumes 0 and 1 as the factor variable. I would appreciate any help.
Here's what I did:
Try #1:
sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(1,0),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 2 1 2 1 1
It seems that factor levels are now reset to 1 and 2. I believe 1 and 2s are references to the factor level here. However, I have lost the original values i.e. 0s and 1s.
Try2:
sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(0,1),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 1 2 1 2 2
Ditto. My 0's and 1's are now 1's and 2's. Quite Surprising. Why is this happening.
Try3 Now, I wanted to see whether 1s and 2s have any bad effect regression. So, here's what I did:
Here's what my data looks like:
> head(data.frame(sassign$total_,sassign$gender))
sassign.total_ sassign.gender
1 357 M
2 138 M
3 172 F
4 272 F
5 149 F
6 113 F
myfit<-lm(sassign$total_ ~ sassign$gender)
myfit$coefficients
(Intercept) sassign$genderM
200.63522 23.00606
So, it turns out that the means are correct. While running the regression, R did use 0 and 1 value as dummies.
I did check other threads on SO, but they mostly talk about how R codes factor variables without telling me why. Stata and SPSS generally require the base variable to be "0." So, I thought of asking about this.
I'd appreciate any thoughts.