0
votes

I want to run a multinomial regression to get the average frequency of each choice of a close question, split by a factor (gender: male/female).

Background

I want to compare 4 types of cheeses to measure each one's popularity, out of 4 possibilities: cheddar, mozzarella, gouda, and brie. I go out and ask 200 people for their preferred cheese. Each person selects only one choice out of the 4 types. I end up collecting some demographics information as well, including gender, age, and weight.

Upon finishing data collection, I want to see the proportion of popularity of each cheese type (together they sum to 100%). Since I want to control for gender, age, and weight, I think that a multinomial regression is suitable here.

But I'm also very interested in seeing how results differ between males and females, and I want to include gender in my model as a factor. How can I generate a twofold prediction based on my (multinomial) model, that would get predicted values for females and males separately so I could compare between the two gender levels?

Data

library(truncnorm)
library(tidyverse)

set.seed(999)

cheese_df <-
  tibble(
    age = round(rtruncnorm(
      n = 200,
      a = 20,
      b = 80,
      mean = 25,
      sd = 25.09
    )),
    cheese_response = as_factor(sample(
      c("cheddar", "mozzarella", "gouda", "brie"),
      size = 200,
      replace = TRUE
    )),
    gender = sample(c(0, 1), size = 200, replace = TRUE),
    weight = rtruncnorm(
      n = 200,
      a = 40,
      b = 120,
      mean = 70,
      sd = 25.09
    )
  )


> cheese_df

## # A tibble: 200 x 4
##      age cheese_response gender weight
##    <dbl> <fct>            <dbl>  <dbl>
##  1    45 cheddar              0   62.2
##  2    32 cheddar              0   45.0
##  3    58 cheddar              1   87.6
##  4    28 brie                 0   68.8
##  5    49 gouda                0   88.2
##  6    29 brie                 1   74.5
##  7    49 cheddar              0   74.0
##  8    27 gouda                1   90.3
##  9    28 brie                 0   56.5
## 10    48 mozzarella           0   72.9
## # ... with 190 more rows

If I just want to run a multinomial regression and control for age, gender, and weight **without splitting by gender** I could do:

library(nnet)
library(effects)


fit <- nnet::multinom(cheese_response ~ age + gender + weight, data = cheese_df)

average_person_for_control <-
  c(
    age = 50,
    gender = 0.5,
    weight = 75
  )

prediction <-
  effects::Effect("age",
                  fit,
                  given.values = average_person_for_control,
                  xlevels = list(age =
                                   c(45, 90)))


proportions_for_plot <-
  data.frame(prediction$prob, prediction$lower.prob, prediction$upper.prob) %>% 
  slice(1) %>%
  pivot_longer(., cols = everything(), 
               names_to = c(".value", "response"), 
               names_pattern = "(.*)\\.(.*$)") %>%
  rename("lower_ci" = "L.prob",
         "upper_ci" = "U.prob",
         "estimate" = "prob")


ggplot(proportions_for_plot, aes(x = reorder(response, -estimate), y = estimate)) +
  geom_bar(stat = "identity", width = 0.7, fill = "darkgreen") +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci),
                width = 0.2) +
  geom_text(aes(label = paste0(100*round(estimate,2), "%")),
            vjust = 1.6, 
            color = "white", size = 3) +
  xlab("cheese type") +
  ylab("proportion of people choosing this type")

controlled_for_no_split

However, I'm interested in generating the same bar plot only that it would split bars male vs female


This is the kind of plot I'm trying to get

(ignore the values in this demo) enter image description here

One approach would be to subset the data by gender, run the same model on each subset, generate two barplots and unite them. However, I want to incorporate gender within the model as a factor, and only then output the split bar plot. This is partially taken care of, as gender is already part of the model: fit <- nnet::multinom(cheese_response ~ age + gender + weight, data = cheese_df).

Still, as far as splitting the predictions by gender, in order to compare them side-by-side in a bar plot, I run into trouble. This is because effects::Effect() accepts only a vector into its given.values argument. Otherwise, I would have done something like the following to feed the prediction (like I would do if I were to use predict):

control_by_gender <-
  expand.grid(
    age = 50,
    weight = 75,
    gender = c(0, 1)
  )

> control_by_gender

##   age weight gender
## 1  50     75      0
## 2  50     75      1

Any idea how I could get such a multiple -- rather than focal -- prediction when working on a multinomial model object as seen above? My ultimate goal is the split-by-gender bar plot, like the demo above. I've been using Effects::effect for generating the prediction but am open to any substitute that could do the multiple prediction trick.

2
This seems to be mainly about code, so should be imgrated to SO.kjetil b halvorsen
Why slice(1)? It seems like you are just getting rid of one age category.andrew_reece
@andrew_reece, slice(1) is because xlevels takes 2 values for age: 45 and 90. However, it's a workaround since I'm not interested in 90 at all. But try including only a single value for xlevels (e.g. 45) and see the messy output it gives. So I ended up including 2 values for xlevels only to get rid of one later (by slice). If you know a more elegant method I'll be grateful. The only reason I use effects::Effect is to get a focal prediction given single values for age, weight, and gender. On top of that comes the current post of generating 2 predictions by gender.Emman

2 Answers

1
votes

Why not just lapply the levels into the effects::Effect call


prediction <- do.call(rbind,lapply(0:1, function(x) {
    eff <- effects::Effect("age",
                  fit,
                  given.values =c(age = 50,
                        weight = 75,
                        gender = x),
                  xlevels = list(age =c(45, 90)))
    data.frame(level=x, eff$prob, eff$lower.prob, eff$upper.prob) %>% slice(1)
    }))


proportions_for_plot <-
  prediction %>% 
  pivot_longer(., cols = !level, 
               names_to = c(".value", "response"), 
               names_pattern = "(.*)\\.(.*$)") %>%
  rename("lower_ci" = "L.prob",
         "upper_ci" = "U.prob",
         "estimate" = "prob")


ggplot(proportions_for_plot, aes(x = as.factor(response), y = estimate, fill=factor(level))) +
  geom_bar(stat = "identity", width = 0.7,position="dodge") +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), position=position_dodge(.9),
                width = 0.2) +
  geom_text(aes(label = paste0(100*round(estimate,2), "%")),
            vjust = 1.6, 
            color = "white", size = 3, position=position_dodge(.9)) +
  xlab("cheese type") +
  ylab("proportion of people choosing this type")

1
votes

This answer uses the same intuition as @Abdessabour Mtk, just with purrr::map instead and a bit of refactoring:

make_eff_df <- function(gender, fit) {
  Effect("age", fit, xlevels = list(age = c(45, 90)),
         given.values = c(age = 50, weight = 75, gender = gender)) %>%
    as_tibble() %>%
    mutate(gender = gender) %>%
    select(gender, matches("[a-z\\.]?prob")) %>%
    slice(1)
}


map_dfr(0:1, make_eff_df, fit) %>% 
  pivot_longer(-gender, names_to = c(".value", "response"), 
               names_pattern = "(.+)\\.(.+$)") %>%
  rename(lower_ci = "L.prob", upper_ci = "U.prob", estimate = "prob") %>%
  mutate(across(1:2, as.factor)) %>%
  ggplot(aes(x = reorder(response, -estimate), y = estimate, fill = gender)) +
  geom_bar(stat = "identity", width = 0.7, position = position_dodge(.9)) +
  geom_errorbar(aes(ymin = lower_ci, ymax = upper_ci), 
                position = position_dodge(.9),
                width = 0.2) +
  geom_text(aes(label = scales::percent(estimate, accuracy = 1)),
            vjust = 1.6, color = "white", size = 3, position=position_dodge(.9)) +
  labs(x = "cheese type",
       y = "proportion of people choosing this type")

plot