3
votes

I have a data frame with columns genes, the region of the chromosome they belong to, the cell line the gene expression was measured from, and the gene's expression level in that cell line -- it looks basically something like this:

gene    region    cell_line    expression
A       X         Joe          1
B       X         Joe          2 
C       Y         Joe          2
D       Z         Joe          3
E       Z         Joe          0
A       X         Claire       2
B       X         Claire       1
C       Y         Claire       3
D       Z         Claire       3
E       Z         Claire       1

What I want to do is, for each cell line, calculate the mean, standard deviation, etc. for a chromosomal region of all genes NOT in the given region. So for region X of Joe, for example, I want the output "summarize()" row to show the mean of the expression for all genes NOT in Joe's X (i.e. genes C, D, E of Joe).

So the output looks something like:

region    cell_line     mean_other    standard_deviation_other   
X         Joe           1.67          some number
Y         Joe           1.5           some number
Z         Joe           1.67          some number
X         Claire        2.33          some number
Y         Claire        2.33          some number
Z         Claire        2             some number

My idea would be to do the following, except I have no clue on how to get summarize to manipulate groups outside of the one it's "operating on" at a given time.

df %>% group_by(region, cell_line) %>% 
 summarize(mean_other = mean(expression of cell lines not in this group),
           standard_deviation_other = var(expression of cell lines not in this group)
1

1 Answers

3
votes

We can use the new dplyr::group_modify() to easily apply a function across groups, which takes each group as a data frame. Then we can just use dplyr::anti_join() on the original data frame and apply whatever you wanted in your summarize.

Using mtcars:

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  group_modify(~anti_join(mtcars, .) %>%
                 summarize(disp_m = mean(disp),
                           disp_sd = sd(disp)))
#> # A tibble: 3 x 3
#> # Groups:   cyl [3]
#>     cyl disp_m disp_sd
#>   <dbl>  <dbl>   <dbl>
#> 1     4   297.   101. 
#> 2     6   244.   136. 
#> 3     8   136.    50.7

And checking for the first group with cyl == 4:

mtcars %>%
  filter(cyl != 4) %>%
  summarize(disp_m = mean(disp),
            disp_sd = sd(disp))
#>     disp_m  disp_sd
#> 1 296.5048 101.1434

On your df, it should look like this:

df %>%
  group_by(region, cell_line) %>%
  group_modify(~anti_join(df, .) %>%
               summarize(mean_other = mean(expression),
                         sd_other = var(expression)))