5
votes

Take the following example data:

set.seed(1)

foo <- data.frame(x=rnorm(10, 0, 10), y=rnorm(10, 0, 10), fac = c(rep("A", 5), rep("B", 5)))

I want to split the dataframe "foo" by the variable "fac" into A's and B's, apply a function (mahalanobis distance) that returns a vector of the length of each subgroup, and then mutate the output back on to the original dataframe. For example:

auto.mahalanobis <- function(x) {
  temp <- x[, c("x", "y")]
  return(mahalanobis(temp, center = colMeans(temp, na.rm=T), cov = cov(temp, 
use="pairwise.complete.obs")))
}

foo %>% group_by(fac) %>%
  mutate(mahal = auto.mahalanobis(.))

Which gives an error. Obviously this procedure can be done manually by splitting the dataset, applying the function, and adding the output as a column before putting it all back together again. But there must be a more efficient way to do this (perhaps this is a misuse of dplyr?).

3

3 Answers

2
votes

How about making use of nest instead:

foo %>%
    group_by(fac) %>%
    nest() %>%
    mutate(mahal = map(data, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(.x, use = "pairwise.complete.obs")))) %>%
    unnest()
## A tibble: 10 x 4
#   fac   mahal      x       y
#   <fct> <dbl>  <dbl>   <dbl>
# 1 A     1.02   -6.26  15.1
# 2 A     0.120   1.84   3.90
# 3 A     2.81   -8.36  -6.21
# 4 A     2.84   16.0  -22.1
# 5 A     1.21    3.30  11.2
# 6 B     2.15   -8.20  -0.449
# 7 B     2.86    4.87  -0.162
# 8 B     1.23    7.38   9.44
# 9 B     0.675   5.76   8.21
#10 B     1.08   -3.05   5.94

Here you avoid an explicit "x", "y" filter of the form temp <- x[, c("x", "y")], as you nest relevant columns after grouping by fac. Applying mahalanobis is then straight-forward.


Update

To respond to your comment, here is a purrr option. Since it's easy to loose track of what's going on, let's go step-by-step:

  1. Generate sample data with one additional column.

    set.seed(1)
    foo <- data.frame(
        x = rnorm(10, 0, 10),
        y = rnorm(10, 0, 10),
        z = rnorm(10, 0, 10),
        fac = c(rep("A", 5), rep("B", 5)))
    
  2. We now store the columns which define the subset of the data to be used for the calculation of the Mahalanobis distance in a list

    cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))
    

    So we will calculate the Mahalanobis distance (per fac) for the subset of data in columns x+y and then separately for y+z. The names of cols will be used as the column names of the two distance vectors.

  3. Now for the actual purrr command:

    imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
        mutate_all(function(lst) map(lst, ~mahalanobis(
            .x,
            center = colMeans(.x, na.rm = T),
            cov = cov(., use = "pairwise.complete.obs")))) %>%
        unnest() %>%
        bind_cols(foo, .)
    #           x           y           z fac     cols1     cols2
    #1  -6.264538  15.1178117   9.1897737   A 1.0197542 1.3608052
    #2   1.836433   3.8984324   7.8213630   A 0.1199607 1.1141352
    #3  -8.356286  -6.2124058   0.7456498   A 2.8059562 1.5099574
    #4  15.952808 -22.1469989 -19.8935170   A 2.8401953 3.0675228
    #5   3.295078  11.2493092   6.1982575   A 1.2141337 0.9475794
    #6  -8.204684  -0.4493361  -0.5612874   B 2.1517055 1.2284793
    #7   4.874291  -0.1619026  -1.5579551   B 2.8626501 1.1724828
    #8   7.383247   9.4383621 -14.7075238   B 1.2271316 2.5723023
    #9   5.757814   8.2122120  -4.7815006   B 0.6746788 0.6939081
    #10 -3.053884   5.9390132   4.1794156   B 1.0838341 2.3328276
    

    In short, we

    1. loop over entries in cols,
    2. nest data in foo per fac based on columns defined in cols,
    3. apply mahalanobis on the nested and grouped data generating as many distance columns with nested data as we have entries in cols (i.e. subsets), and
    4. finally unnest the distance data and column-bind it to the original foo data.
2
votes

You can simply do -

foo %>% group_by(fac) %>%
  mutate(mahal = auto.mahalanobis(data.frame(x, y)))

# A tibble: 10 x 4
# Groups:   fac [2]
        x       y fac   mahal
    <dbl>   <dbl> <fct> <dbl>
 1 - 6.26  15.1   A     1.02 
 2   1.84   3.90  A     0.120
 3 - 8.36 - 6.21  A     2.81 
 4  16.0  -22.1   A     2.84 
 5   3.30  11.2   A     1.21 
 6 - 8.20 - 0.449 B     2.15 
 7   4.87 - 0.162 B     2.86 
 8   7.38   9.44  B     1.23 
 9   5.76   8.21  B     0.675
10 - 3.05   5.94  B     1.08

You can remove temp <- x[, c("x", "y")] from your function and simply use temp instead of x as function argument.

Cleaned-up function -

auto.mahalanobis <- function(temp) {
  mahalanobis(temp,
              center = colMeans(temp, na.rm=T),
              cov = cov(temp, use="pairwise.complete.obs")
              )
}

Btw, great job on your first post!

0
votes

Alternative: group_modify

There is also the way of using the group_modify function of dplyr what makes the code shorter.

Extension 1: Select specific columns

I have used the Mahalanobis distance to identify multivariate outliers. Here it makes sense to add variable names that should be considered (This is also an answer to @TKraft). You can specify now the columns that should be by cols=c("x","y")

Extension 2: Determine threshold for multivariate outliers

Moreover, the distance alone cannot be used directly to filter out possible outliers, since one needs a threshold. This threshold could be determined by the chi-square distribution, where the degrees of freedom are equal to the number of variables to be considered.

    auto.mahalanobis <- function(temp, cols, chisq.prob=0.95) {
        temp_sel <- temp %>% select(any_of(cols))
        temp$mah <- mahalanobis(temp_sel,
                  center = colMeans(temp_sel, na.rm=T),
                  cov = cov(temp_sel, use="pairwise.complete.obs")
                  )
        threshold <- qchisq(chisq.prob, length(cols))
        temp$mah_chisq <- temp$mah > threshold
      return(temp)}
      
      
    foo %>%
      group_by(fac) %>%
      group_modify(~ auto.mahalanobis(temp=.x, cols=c("x","y"), chisq.prob = .975))
# A tibble: 10 x 5
# Groups:   fac [2]
   fac       x       y   mah mah_chisq
   <chr> <dbl>   <dbl> <dbl> <lgl>    
 1 A     -6.26  15.1   1.02  FALSE    
 2 A      1.84   3.90  0.120 FALSE    
 3 A     -8.36  -6.21  2.81  FALSE    
 4 A     16.0  -22.1   2.84  FALSE    
 5 A      3.30  11.2   1.21  FALSE    
 6 B     -8.20  -0.449 2.15  FALSE    
 7 B      4.87  -0.162 2.86  FALSE    
 8 B      7.38   9.44  1.23  FALSE    
 9 B      5.76   8.21  0.675 FALSE    
10 B     -3.05   5.94  1.08  FALSE