When R performs a regression using a categorical variable, it's effectively dummy coding. That is, one of levels is omitted as base or reference and the regression formula includes dummies for all the other levels. But which one is it, that R picks as reference and how I can influence this choice?
Example data with four levels (from UCLA's IDRE):
hsb2 <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
summary(lm(write ~ factor(race), data = hsb2))
# level 1 is the reference level
hsb2.ordered <- hsb2[rev(order(hsb2$race)),]
summary(lm(write ~ factor(race), data = hsb2.ordered))
# level 1 is still the reference level