How can I draw a barplot for means, while controlling for other variables through regression -- in a split-bars-by-vars fashion?
My general problem
I conduct a research to figure out which fruit is more likable: mango, banana, or apple. To this end, I go ahead and sample 100 people at random. I ask them to rate, on a scale of 1-5, the degree of liking each of the fruits. I also collect some demographic information about them: gender, age, education level, and whether they are colorblind or not because I think color vision might alter the results. But my problem is that after data collection, I realize that my sample might not represent the general population well. I have 80% males while in the population sex is more evenly split. Education level in my sample is pretty uniform, even though in the population it's more common to hold only highschool diploma than have a PhD. Age is not representative as well.
Therefore, just calculating means for fruit liking based on my sample is likely to be limited in terms of generalizing conclusions to the population level. One way to deal with this problem is by running a multiple regression to control for the biased demographics data.
I want to plot the results of the regression(s) in a barplot, where I split bars (side-by-side) according to color vision levels (colorblind or not).
My data
library(tidyverse)
set.seed(123)
fruit_liking_df <-
data.frame(
id = 1:100,
i_love_apple = sample(c(1:5), 100, replace = TRUE),
i_love_banana = sample(c(1:5), 100, replace = TRUE),
i_love_mango = sample(c(1:5), 100, replace = TRUE),
age = sample(c(20:70), 100, replace = TRUE),
is_male = sample(c(0, 1), 100, prob = c(0.2, 0.8), replace = TRUE),
education_level = sample(c(1:4), 100, replace = TRUE),
is_colorblinded = sample(c(0, 1), 100, replace = TRUE)
)
> as_tibble(fruit_liking_df)
## # A tibble: 100 x 8
## id i_love_apple i_love_banana i_love_mango age is_male education_level is_colorblinded
## <int> <int> <int> <int> <int> <dbl> <int> <dbl>
## 1 1 3 5 2 50 1 2 0
## 2 2 3 3 1 49 1 1 0
## 3 3 2 1 5 70 1 1 1
## 4 4 2 2 5 41 1 3 1
## 5 5 3 1 1 49 1 4 0
## 6 6 5 2 1 29 0 1 0
## 7 7 4 5 5 35 1 3 0
## 8 8 1 3 5 24 0 3 0
## 9 9 2 4 2 55 1 2 0
## 10 10 3 4 2 69 1 4 0
## # ... with 90 more rows
If I just want to get the mean values for each fruit liking level
fruit_liking_df_for_barplot <-
fruit_liking_df %>%
pivot_longer(.,
cols = c(i_love_apple, i_love_banana, i_love_mango),
names_to = "fruit",
values_to = "rating") %>%
select(id, fruit, rating, everything())
ggplot(fruit_liking_df_for_barplot, aes(fruit, rating, fill = as_factor(is_colorblinded))) +
stat_summary(fun = mean,
geom = "bar",
position = "dodge") +
## errorbars
stat_summary(fun.data = mean_se,
geom = "errorbar",
position = "dodge") +
## bar labels
stat_summary(
aes(label = round(..y.., 2)),
fun = mean,
geom = "text",
position = position_dodge(width = 1),
vjust = 2,
color = "white") +
scale_fill_discrete(name = "is colorblind?",
labels = c("not colorblind", "colorblind")) +
ggtitle("liking fruits, without correcting for demographics")
But what if I want to correct these means to better represent the population?
I can use multiple regression
I will correct for the average age in the population which is 45
I will correct for the correct 50-50 split for sex
I will correct for the common education level that is highschool (coded
2
in my data)I also have a reason to believe that age affects the liking of fruits in a non-linear way, so I will account for that as well.
lm(fruit ~ I(age - 45) + I((age - 45)^2) + I(is_male - 0.5) + I(education_level - 2)
I will run the three fruits data (apple, banana, mango) through the same model, extract the intercept, and consider that as the corrected mean after controlling for the demographics data.
First, I'll run the regressions on data with colorblind people only
library(broom)
dep_vars <- c("i_love_apple",
"i_love_banana",
"i_love_mango")
regresults_only_colorblind <-
lapply(dep_vars, function(dv) {
tmplm <-
lm(
get(dv) ~ I(age - 45) + I((age - 45)^2) + I(is_male - 0.5) + I(education_level - 2),
data = filter(fruit_liking_df, is_colorblinded == 1)
)
broom::tidy(tmplm) %>%
slice(1) %>%
select(estimate, std.error)
})
data_for_corrected_barplot_only_colorblind <-
regresults_only_colorblind %>%
bind_rows %>%
rename(intercept = estimate) %>%
add_column(dep_vars, .before = c("intercept", "std.error"))
## # A tibble: 3 x 3
## dep_vars intercept std.error
## <chr> <dbl> <dbl>
## 1 i_love_apple 3.07 0.411
## 2 i_love_banana 2.97 0.533
## 3 i_love_mango 3.30 0.423
Then plot corrected barplot for colorblind only
ggplot(data_for_corrected_barplot_only_colorblind,
aes(x = dep_vars, y = intercept)) +
geom_bar(stat = "identity", width = 0.7, fill = "firebrick3") +
geom_errorbar(aes(ymin = intercept - std.error, ymax = intercept + std.error),
width = 0.2) +
geom_text(aes(label=round(intercept, 2)), vjust=1.6, color="white", size=3.5) +
ggtitle("liking fruits after correction for demogrpahics \n colorblind subset only")
Second, I'll repeat the same regression(s) process on data with color vision only
dep_vars <- c("i_love_apple",
"i_love_banana",
"i_love_mango")
regresults_only_colorvision <-
lapply(dep_vars, function(dv) {
tmplm <-
lm(
get(dv) ~ I(age - 45) + I((age - 45)^2) + I(is_male - 0.5) + I(education_level - 2),
data = filter(fruit_liking_df, is_colorblinded == 0) ## <- this is the important change here
)
broom::tidy(tmplm) %>%
slice(1) %>%
select(estimate, std.error)
})
data_for_corrected_barplot_only_colorvision <-
regresults_only_colorvision %>%
bind_rows %>%
rename(intercept = estimate) %>%
add_column(dep_vars, .before = c("intercept", "std.error"))
ggplot(data_for_corrected_barplot_only_colorvision,
aes(x = dep_vars, y = intercept)) +
geom_bar(stat = "identity", width = 0.7, fill = "orchid3") +
geom_errorbar(aes(ymin = intercept - std.error, ymax = intercept + std.error),
width = 0.2) +
geom_text(aes(label=round(intercept, 2)), vjust=1.6, color="white", size=3.5) +
ggtitle("liking fruits after correction for demogrpahics \n colorvision subset only")
What I'm ultimately looking for is to combine the corrected plots
Final note
This is primarily a question about ggplot
and graphics. However, as can be seen, my method is long (i.e., not concise) and repetitive. Especially relative to the simplicity of just getting barplot for uncorrected means, as demonstrated in the beginning. I will be very happy if someone has also ideas on how to make the code shorter and simpler.