0
votes

I have a large dataset and there are many different columns that I am trying to group the data by. I am trying to create a new column using dplyr and mutate which is the mean for each individual group. I then want to see the difference between these means and the mean of just one single category.

This question can pertain to the mtcars dataset. How would I group the mtcars data by "cyl" & "gear" and then take the mean of "mpg" for each group. I then want to see the difference of every group's mean of "mpg" compared to specifically all the cars with "gear"==5, but have variable "cyl".

I apologize if I'm asking the same question as others have, but I have not been able to find this specific question.

df <- mtcars
df2 <- df %>% group_by(cyl, gear) %>% mutate(mean_mpg = mean(mpg))
3
df2 <- df %>% group_by(cyl, gear) %>% summarise(mean_mpg = mean(mpg)) should get you startedJack Brookes
What does "but have variable "cyl" " mean ?Jack Brookes
I want to see the difference of the means for each 4 cylinder vehicles relative to cars with 5 gears & 4 cylinders, difference of means for 6 cyl relative to 5 gears & 6 cyl, etc.Brent B
@BrentB Your comment seems to contradict your question. Your question says "group the mtcars data by "cyl" & "gear" and then take the mean of "mpg" for each group", meaning you have means for, e.g., cars with 4 cyl and 3 gears, 4 cyl and 4 gears, 4 cyl and 5 gears, etc. (Jack Brookes's answer covers this case). But your comment above seems to say you want means for 4, 6, 8 and cyl (ignoring gear), and compare those to means of 4, 6, and 8 cyl where gear is 5. I answered using my interpretation of you comment. Please edit your question to make your goal clear.Gregor Thomas

3 Answers

0
votes
df2 <- df %>%
  group_by(cyl, gear) %>%
  summarise(mean_mpg = mean(mpg)) %>%
  mutate(comparison_mpg = mean_mpg[which(gear == 5)],
         mpg_diff = mean_mpg - comparison_mpg)

Result

# A tibble: 8 x 5
# Groups:   cyl [3]
    cyl  gear mean_mpg comparison_mpg mpg_diff
  <dbl> <dbl>    <dbl>          <dbl>    <dbl>
1    4.    3.     21.5           28.2  -6.70  
2    4.    4.     26.9           28.2  -1.27  
3    4.    5.     28.2           28.2   0.    
4    6.    3.     19.8           19.7   0.0500
5    6.    4.     19.8           19.7   0.0500
6    6.    5.     19.7           19.7   0.    
7    8.    3.     15.0           15.4  -0.350 
8    8.    5.     15.4           15.4   0.    
1
votes

This is pretty brute force but it should give you what you want. I got the mean mpg of both cyl and gear then just of cyl ignoring gear and then the mean mpg of gear ignoring cyl .

mtcars %>%  
  group_by(cyl,gear) %>% 
  mutate(mean_mpg_both = mean(mpg)) %>% 
  ungroup %>% 
  group_by(gear) %>% 
  mutate(mean_gear_mpg = mean(mpg)) %>% 
  ungroup %>% 
  group_by(cyl) %>% 
  mutate(mean_cyl_mpg = mean(mpg)) %>% 
  select(mpg,cyl,gear,mean_mpg_both,mean_gear_mpg, mean_cyl_mpg) %>% 
  group_by(cyl,gear) %>% 
  filter(row_number()==1)
0
votes

Going from your comment, I think this is what you are after:

mtcars %>% group_by(cyl) %>%
    summarize(mean_by_cyl = mean(mpg),
              mean_gear5_by_cyl = mean(mpg[gear == 5]),
              mean_diff_from_gear5 = mean_by_cyl - mean_gear5_by_cyl)
# # A tibble: 3 x 4
#     cyl mean_by_cyl mean_gear5_by_cyl mean_diff_from_gear5
#   <dbl>       <dbl>             <dbl>                <dbl>
# 1     4    26.66364              28.2          -1.53636364
# 2     6    19.74286              19.7           0.04285714
# 3     8    15.10000              15.4          -0.30000000