How to draw a barplot split by variable levels, while controlling for other variables via multiple regression?

Question

How can I draw a barplot for means, while controlling for other variables through regression -- in a split-bars-by-vars fashion?

My general problem

I conduct a research to figure out which fruit is more likable: mango, banana, or apple. To this end, I go ahead and sample 100 people at random. I ask them to rate, on a scale of 1-5, the degree of liking each of the fruits. I also collect some demographic information about them: gender, age, education level, and whether they are colorblind or not because I think color vision might alter the results. But my problem is that after data collection, I realize that my sample might not represent the general population well. I have 80% males while in the population sex is more evenly split. Education level in my sample is pretty uniform, even though in the population it's more common to hold only highschool diploma than have a PhD. Age is not representative as well.

Therefore, just calculating means for fruit liking based on my sample is likely to be limited in terms of generalizing conclusions to the population level. One way to deal with this problem is by running a multiple regression to control for the biased demographics data.

I want to plot the results of the regression(s) in a barplot, where I split bars (side-by-side) according to color vision levels (colorblind or not).

My data

library(tidyverse)

set.seed(123)

fruit_liking_df <-
  data.frame(
    id = 1:100,
    i_love_apple = sample(c(1:5), 100, replace = TRUE),
    i_love_banana = sample(c(1:5), 100, replace = TRUE),
    i_love_mango = sample(c(1:5), 100, replace = TRUE),
    age = sample(c(20:70), 100, replace = TRUE),
    is_male = sample(c(0, 1), 100, prob = c(0.2, 0.8), replace = TRUE),
    education_level = sample(c(1:4), 100, replace = TRUE),
    is_colorblinded = sample(c(0, 1), 100, replace = TRUE)
  )

> as_tibble(fruit_liking_df)

## # A tibble: 100 x 8
##       id i_love_apple i_love_banana i_love_mango   age is_male education_level is_colorblinded
##    <int>        <int>         <int>        <int> <int>   <dbl>           <int>           <dbl>
##  1     1            3             5            2    50       1               2               0
##  2     2            3             3            1    49       1               1               0
##  3     3            2             1            5    70       1               1               1
##  4     4            2             2            5    41       1               3               1
##  5     5            3             1            1    49       1               4               0
##  6     6            5             2            1    29       0               1               0
##  7     7            4             5            5    35       1               3               0
##  8     8            1             3            5    24       0               3               0
##  9     9            2             4            2    55       1               2               0
## 10    10            3             4            2    69       1               4               0
## # ... with 90 more rows

If I just want to get the mean values for each fruit liking level

fruit_liking_df_for_barplot <-
  fruit_liking_df %>%
  pivot_longer(.,
    cols = c(i_love_apple, i_love_banana, i_love_mango),
    names_to = "fruit",
    values_to = "rating") %>%
  select(id, fruit, rating, everything())

ggplot(fruit_liking_df_for_barplot, aes(fruit, rating, fill = as_factor(is_colorblinded))) +
  stat_summary(fun = mean,
               geom = "bar",
               position = "dodge") +
  ## errorbars
  stat_summary(fun.data = mean_se,
               geom = "errorbar",
               position = "dodge") +
  ## bar labels
  stat_summary(
    aes(label = round(..y.., 2)),
    fun = mean,
    geom = "text",
    position = position_dodge(width = 1),
    vjust = 2,
    color = "white") +
  scale_fill_discrete(name = "is colorblind?",
                      labels = c("not colorblind", "colorblind")) +
  ggtitle("liking fruits, without correcting for demographics")

But what if I want to correct these means to better represent the population?

I can use multiple regression

I will correct for the average age in the population which is 45
I will correct for the correct 50-50 split for sex
I will correct for the common education level that is highschool (coded 2 in my data)
I also have a reason to believe that age affects the liking of fruits in a non-linear way, so I will account for that as well.

lm(fruit ~ I(age - 45) + I((age - 45)^2) + I(is_male - 0.5) + I(education_level - 2)

I will run the three fruits data (apple, banana, mango) through the same model, extract the intercept, and consider that as the corrected mean after controlling for the demographics data.

First, I'll run the regressions on data with colorblind people only

library(broom)

dep_vars <- c("i_love_apple",
              "i_love_banana",
              "i_love_mango")

regresults_only_colorblind <-
  lapply(dep_vars, function(dv) {
    tmplm <-
      lm(
        get(dv) ~ I(age - 45) + I((age - 45)^2) + I(is_male - 0.5) + I(education_level - 2), 
        data = filter(fruit_liking_df, is_colorblinded == 1)
      )
    
    broom::tidy(tmplm) %>%
      slice(1) %>%
      select(estimate, std.error)
  })

data_for_corrected_barplot_only_colorblind <-
  regresults_only_colorblind %>%
  bind_rows %>%
  rename(intercept = estimate) %>%
  add_column(dep_vars, .before = c("intercept", "std.error")) 

## # A tibble: 3 x 3
##   dep_vars      intercept std.error
##   <chr>             <dbl>     <dbl>
## 1 i_love_apple       3.07     0.411
## 2 i_love_banana      2.97     0.533
## 3 i_love_mango       3.30     0.423

Then plot corrected barplot for colorblind only

ggplot(data_for_corrected_barplot_only_colorblind, 
       aes(x = dep_vars, y = intercept)) +
  geom_bar(stat = "identity", width = 0.7, fill = "firebrick3") +
  geom_errorbar(aes(ymin = intercept - std.error, ymax = intercept + std.error),
                width = 0.2) +
  geom_text(aes(label=round(intercept, 2)), vjust=1.6, color="white", size=3.5) +
  ggtitle("liking fruits after correction for demogrpahics \n colorblind subset only")

Second, I'll repeat the same regression(s) process on data with color vision only

dep_vars <- c("i_love_apple",
              "i_love_banana",
              "i_love_mango")

regresults_only_colorvision <-
  lapply(dep_vars, function(dv) {
    tmplm <-
      lm(
        get(dv) ~ I(age - 45) + I((age - 45)^2) + I(is_male - 0.5) + I(education_level - 2), 
        data = filter(fruit_liking_df, is_colorblinded == 0) ## <- this is the important change here
      )
    
    broom::tidy(tmplm) %>%
      slice(1) %>%
      select(estimate, std.error)
  })


data_for_corrected_barplot_only_colorvision <-
  regresults_only_colorvision %>%
  bind_rows %>%
  rename(intercept = estimate) %>%
  add_column(dep_vars, .before = c("intercept", "std.error")) 

ggplot(data_for_corrected_barplot_only_colorvision, 
       aes(x = dep_vars, y = intercept)) +
  geom_bar(stat = "identity", width = 0.7, fill = "orchid3") +
  geom_errorbar(aes(ymin = intercept - std.error, ymax = intercept + std.error),
                width = 0.2) +
  geom_text(aes(label=round(intercept, 2)), vjust=1.6, color="white", size=3.5) +
  ggtitle("liking fruits after correction for demogrpahics \n colorvision subset only")

What I'm ultimately looking for is to combine the corrected plots

Final note

This is primarily a question about ggplot and graphics. However, as can be seen, my method is long (i.e., not concise) and repetitive. Especially relative to the simplicity of just getting barplot for uncorrected means, as demonstrated in the beginning. I will be very happy if someone has also ideas on how to make the code shorter and simpler.

BrianLang BrianLang · Accepted Answer · 2020-08-20T09:37:41

I'm not convinced that you're getting out the statistical quantities that you want when fit the model on the data subsets. A better way to ask the questions you want to ask would be with a more complete model (include blindness in the model) and then compute model contrasts for differences in the mean score between each group.

That being said, here is some code that does what you want.

First we pivot_longer the fruit columns so that your data is in long format.
Then we group_by the fruit type and the blindness variables and call nest which gives us separate datasets for each fruit type and blindness categories.
We then use purrr::map to fit a model to each of those datasets.
broom::tidy and broom::confint_tidy give us the statistics we want for the models.
Then we need to unnest the model summaries and filter down specifically to the rows which correspond to the intercept.
We now have the data we need to create the figure, I'll leave the rest to you.

library(tidyverse)

set.seed(123)

fruit_liking_df <-
  data.frame(
    id = 1:100,
    i_love_apple = sample(c(1:5), 100, replace = TRUE),
    i_love_banana = sample(c(1:5), 100, replace = TRUE),
    i_love_mango = sample(c(1:5), 100, replace = TRUE),
    age = sample(c(20:70), 100, replace = TRUE),
    is_male = sample(c(0, 1), 100, prob = c(0.2, 0.8), replace = TRUE),
    education_level = sample(c(1:4), 100, replace = TRUE),
    is_colorblinded = sample(c(0, 1), 100, replace = TRUE)
  )

model_fits <- fruit_liking_df %>%
  pivot_longer(starts_with("i_love"), values_to = "fruit") %>% 
  group_by(name, is_colorblinded) %>%
  nest() %>% 
  mutate(model_fit = map(data, ~ lm(data = .x, fruit ~ I(age - 45) +
                                      I((age - 45)^2) +
                                      I(is_male - 0.5) + 
                                      I(education_level - 2))),
         model_summary = map(model_fit, ~ bind_cols(broom::tidy(.x), broom::confint_tidy(.x)))) 

model_fits %>%
  unnest(model_summary) %>%
  filter(term == "(Intercept)") %>% 
  ggplot(aes(x = name, y = estimate, group = is_colorblinded,
             fill = as_factor(is_colorblinded), colour = as_factor(is_colorblinded))) +
  geom_bar(stat = "identity", position = position_dodge(width = .95)) +
  geom_errorbar(stat = "identity", aes(ymin = conf.low, ymax = conf.high),
                colour = "black", width = .15, position = position_dodge(width = .95))

EDIT

In the case where you'd rather fit a single model (thus increasing sample size and reducing se of your estimates). You could pull is_colorblind into the model as a factor.

lm(data = .x, fruit ~ I(age - 45) +
 I((age - 45)^2) + I(is_male - 0.5) + 
 I(education_level - 2) + 
 as.factor(is_colorblind))

You would then want to get predictions for two observations, the "average person who is colorblind" and the "average person who is not colorblind":

new_data <- expand_grid(age = 45, is_male = .5, 
                        education_level = 2.5, is_colorblinded = c(0,1))

You could then do as before, fitting the new model with some functional programming, but group_by(name) instead of name and is_colorblind.

model_fits_ungrouped <- fruit_liking_df %>%
  pivot_longer(starts_with("i_love"), values_to = "fruit") %>% 
  group_by(name) %>%
  tidyr::nest() %>% 
  mutate(model_fit = map(data, ~ lm(data = .x, fruit ~ I(age - 45) +
                                      I((age - 45)^2) +
                                      I(is_male - .5) + 
                                      I(education_level - 2) +
                                      as.factor(is_colorblinded))),
         predicted_values = map(model_fit, ~ bind_cols(new_data, 
                                                       as.data.frame(predict(newdata = new_data, .x, 
                                                                             type = "response", se.fit = T))) %>%
                                  rowwise() %>%
                                  mutate(estimate =  fit, 
                                         conf.low =  fit - qt(.975, df) * se.fit, 
                                         conf.high = fit + qt(.975, df) * se.fit)))

With this you would make a minor change to the old plotting code:

model_fits_ungrouped %>%
  unnest(predicted_values) %>%
  ggplot(aes(x = name, y = estimate, group = is_colorblinded,
             fill = as_factor(is_colorblinded), colour = as_factor(is_colorblinded))) +
geom_bar(stat = "identity", position = position_dodge(width = .95)) +
 geom_errorbar(stat = "identity", aes(ymin = conf.low, ymax = conf.high),
                colour = "black", width = .15, position = position_dodge(width = .95))

When you compare the two plots, grouped and subgrouped, you'll notice that the confidence intervals shrink and the estimates for the means mostly get closer to 3. This would be seen as a sign that we are doing a bit better than the subgrouped model, since we know the ground truth with regards to the sampled distributions.