Factor Levels and Modelling in R

Question

The following code runs a very simple lm() and tries to summarise the results (factor level, coefficient) in a small data frame:

df <- data.frame(star_sign = c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces"),
                 y = c(1.1, 1.2, 1.4, 1.3, 1.8, 1.6, 1.4, 1.3, 1.2, 1.1, 1.5, 1.3))

levels(df$star_sign) #alphabetical order

# fit a simple linear model

my_lm <- lm(y ~ 1 + star_sign, data = df)
summary(my_lm) # intercept is based on first level of factor, aquarius

# I want the levels to work properly 1..12 = Aries, Taurus...Pisces so I'm going to redefine the factor levels

df$my_levels <- c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces")

df$star_sign <- factor(df$star_sign, levels = df$my_levels)

my_lm <- lm(y ~ 1 + star_sign_, data = df)
summary(my_lm) # intercept is based on first level of factor which is now Aries

# but for my model fit I want the reference level to be Virgo (because reasons)

df$star_sign_2 <- relevel(df$star_sign, ref = "Virgo")

my_lm <- lm(y ~ 1 + star_sign_2, data = df)
summary(my_lm)

df_results <- data.frame(factor_level = names(my_lm$coefficients), coeff = my_lm$coefficients )

# tidy up
rownames(df_results) <- 1:12
df_results$factor_level <- as.factor(gsub("star_sign_2", "", df_results$factor_level))

# change label of "(Intercept)" to "Virgo"
df_results$factor_level <- plyr::revalue(df_results$factor_level, c("(Intercept)" = "Virgo"))

levels(df_results$factor_level) # the levels are alphabetical + Virgo at the front (not same as display order from lm)

The factor levels aren't in the right order: I want to sort df_results so that the star signs appear in the same order as they do originally (Aries, Taurus...Pisces), as captured in the df$my_levels column. I don't think I have a good understanding of manipulating factors and their labels/levels, etc. so I'm struggling to know how to do this.

Also this is quite a long-winded and clumsy bit of code. Are there more concise ways to do this sort of thing?

Thank you.

(ps mathematically the model is obviously trivial, but that's ok for these purposes - I'm just interested in how to manipulate the outputs)

Gregor Thomas Gregor Thomas · Accepted Answer · 2020-01-28T00:01:51

Here's how I would approach the model coefficient extraction, using the broom package (and dplyr):

library(broom)
library(dplyr)
broom::tidy(my_lm) %>%
  mutate(term = sub("star_sign_2", "", term),
         term = ifelse(term == "(Intercept)", "Virgo", term),
         term = factor(term, levels = unique(term)))
# A tibble: 12 x 5
   term        estimate std.error statistic p.value
   <fct>          <dbl>     <dbl>     <dbl>   <dbl>
 1 Virgo          1.6         NaN       NaN     NaN
 2 Aries         -0.500       NaN       NaN     NaN
 3 Taurus        -0.4         NaN       NaN     NaN
 4 Gemini        -0.2         NaN       NaN     NaN
 5 Cancer        -0.300       NaN       NaN     NaN
 6 Leo            0.20        NaN       NaN     NaN
 7 Libra         -0.2         NaN       NaN     NaN
 8 Scorpio       -0.3         NaN       NaN     NaN
 9 Sagittarius   -0.4         NaN       NaN     NaN
10 Capricorn     -0.500       NaN       NaN     NaN
11 Aquarius      -0.1         NaN       NaN     NaN
12 Pisces        -0.300       NaN       NaN     NaN

Setting the levels = unique(term) is a nice trick for putting the levels in the order in which they occur.

The other advice I have is to keep a vector of the levels in the order you want not in the data frame, and then refer to that whenever you need to establish order. For example,

astro_order = c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces")

# messy but effective:
astro_order_virgo1 = factor(astro_order, levels = astro_order) %>% 
  relevel("Virgo") %>%
  levels()

So then you could replace the last step above with term = factor(term, levels = astro_order_virgo1).

This approach of keeping the level order separate is good because (a) it won't change if you reorder your data frame, and (b) it works just as well if your data frame is long and you have repeat entries of your factor levels.

Factor Levels and Modelling in R

2 Answers