0
votes

The dataset

gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
df <- data.frame(gender, answer)

is biased towards females:

df %>% ggplot(aes(gender, fill = gender)) + geom_bar()

enter image description here

My task is to build a graph that makes it easy to figure out which of the two genders is more likely to say 'Yes'.

But, given the bias, I cannot just do

df %>% ggplot(aes(x = answer, fill = gender)) + geom_bar(position = 'dodge')

enter image description here

or even

df %>% ggplot(aes(x = answer, y = ..count../sum(..count..), fill = gender)) +
geom_bar(position = 'dodge')

enter image description here

To alleviate the bias I need to divide each of the counts by the total number of males or females respectively so that the 'Female' bars add up to 1 as well as the 'Male' ones. Like so:

df.total <- df %>% count(gender)
male.total <- (df.total %>% filter(gender == 'Male'))$n
female.total <- (df.total %>% filter(gender == 'Female'))$n

df %>% count(answer, gender) %>% 
mutate(freq = n/if_else(gender == 'Male', male.total, female.total)) %>% 
ggplot(aes(x = answer, y = freq, fill = gender)) + 
geom_bar(stat="identity", position = 'dodge')

enter image description here

Which draws a completely different picture.

Questions:

  1. Is there a way to simplify the former piece of code using only dplyr and ggplot2?
  2. Are there any other libraries that can do the trick better?
  3. Does the above type of chart have a conventional name?

Thanks.

3

3 Answers

3
votes

Question 1:

df %>%  
  count(gender, answer) %>% 
  group_by(gender) %>% 
  mutate(freq = n/sum(n)) %>% 
  ggplot(aes(x = answer, y = freq, fill = gender)) + 
  geom_bar(stat="identity", position = 'dodge')

Question 2:

You can probably do it in fewer lines with other packages.

Question 3:

Relative frequency bar graph.

2
votes

Given the data, the most effective way to determine whether men or women are more likely to answer "yes" to the question asked is to convert the data to a binary variable and run a difference of proportions test.

gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
isYes <- ifelse(answer=="Yes",1,0)

t.test(isYes ~ gender)

...and the output:

> t.test(isYes ~ gender)

    Welch Two Sample t-test

data:  isYes by gender
t = -0.34659, df = 14.749, p-value = 0.7338
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5965761  0.4299094
sample estimates:
mean in group Female   mean in group Male 
           0.4166667            0.5000000 

The t.test() output provides the same percentages of yes as the weighted frequency chart, but the p-value from the test statistic indicates that we should accept the null hypothesis that there is no difference between men and women in their likelihood to answer yes to the question asked.

Another way to interpret the t.test() output is that since 0 is within the 95% confidence interval of the difference of means, we fail to reject the null hypothesis that the means of the two groups are equal.

2
votes

position = "fill" in geom_bar is useful for seeing relative proportions:

library(ggplot2)

df <- data.frame(gender = c("Male", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Female", "Male", "Female", "Female", "Male", "Female", "Female"), 
                 answer = c("Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes"),
                 stringsAsFactors = FALSE)

ggplot(df, aes(gender, fill = answer)) + geom_bar(position = 'fill')

proportion plot