4
votes

I am interested in, for example, replacing (nearly) all of the columns of data.frame or tibble with columns where the row minimum has been subtracted from each row. For example, if X is a numerical matrix, then in base R I would write:

X = sweep(X, 1, apply(X, 1, min))

My current function for doing this using the data I have--I'll explain the format momentarily--pulls out the numerical columns into a matrix, does the sweep, then cbinds the transformed data and the non-numerical data back together again. That is:

subtractMin = function(data){
  X = data %>% 
    select(starts_with("X")) %>% 
    as.matrix()

  X = sweep(X, 1, apply(X, 1, min))

  labels = data %>% 
    select(-starts_with("X"))

  return(cbind(labels, X))
}

This strikes me as inefficient, and there must be a smarter way.

I don't think it is important to know given the context, but my data has 77 rows and 1133 columns. Four of the columns contain label information, and the remaining 1129 contain the numerical measurements for each observation (they're spectra if you care). The number of numerical variables is such that individual mutates are not a way forward. Equally - you still need to know the row minimum to make the standardisation for each row.

I have been asked to add some data. The original data has over 1,000 columns, so I will provide a smaller data set

> x.df
      nm X1799.38928 X1798.01526 X1796.64124 source color rep
1 s001c1   13901.944   13889.056   13883.334     01     c   1
2 s001c2   17293.586   17279.375   17291.365     01     c   2
3 s001c3    8011.764    8028.584    8033.548     01     c   3
4 s001c4    7499.272    7510.719    7517.064     01     c   4
5 s001c5   20300.408   20293.604   20297.185     01     c   5
3
Just a thought, but would it be possible that the data manipulation is awkward because the data is in the wrong format ? data.frames are not ideal for "horizontal" operations though there are a few vectorized functions like pmin that can be leveraged in your case. It would depend of your context but you might be better of having your 1129 columns in a matrix with rownames as current nm, and a 4 column metadata data.frame/tibble on the side. On the matrix you can use sweep, apply with margin=1 etc, because it's what matrices are made for.Moody_Mudskipper

3 Answers

4
votes

(For what it's worth, I think the down-votes are a bit harsh and unwarranted here. The problem statement is clear, and sample data has been included in an edit.)

You can achieve what you're after by converting data in numeric columns from wide to long (using gather), grouping by rows (using group_by), subtracting the minimum (using mutate), and converting back from long to wide (using spread).

library(tidyverse)
df %>%
    gather(k, v, starts_with("X")) %>%
    group_by(nm) %>%
    mutate(v = v - min(v)) %>%
    spread(k, v) %>%
    select(names(df))
## A tibble: 5 x 7
## Groups:   nm [5]
#  nm     X1799.38928 X1798.01526 X1796.64124 source color   rep
#  <fct>        <dbl>       <dbl>       <dbl>  <int> <fct> <int>
#1 s001c1       18.6         5.72        0.        1 c         1
#2 s001c2       14.2         0.         12.0       1 c         2
#3 s001c3        0.         16.8        21.8       1 c         3
#4 s001c4        0.         11.4        17.8       1 c         4
#5 s001c5        6.80        0.          3.58      1 c         5

Sample data

df <- read.table(text =
    "nm X1799.38928 X1798.01526 X1796.64124 source color rep
1 s001c1   13901.944   13889.056   13883.334     01     c   1
2 s001c2   17293.586   17279.375   17291.365     01     c   2
3 s001c3    8011.764    8028.584    8033.548     01     c   3
4 s001c4    7499.272    7510.719    7517.064     01     c   4
5 s001c5   20300.408   20293.604   20297.185     01     c   5")
3
votes

I know you asked for tidyverse / dplyr but in case you overlooked base R, here's a solution:

ind <- !names(df) %in% c("nm","source","color","rep")
df[ind] <- df[ind] - do.call(pmin, df[ind])
df
#       nm X1799.38928 X1798.01526 X1796.64124 source color rep
# 1 s001c1      18.610       5.722       0.000      1     c   1
# 2 s001c2      14.211       0.000      11.990      1     c   2
# 3 s001c3       0.000      16.820      21.784      1     c   3
# 4 s001c4       0.000      11.447      17.792      1     c   4
# 5 s001c5       6.804       0.000       3.581      1     c   5

And I suppose this would qualify as a tidyverse solution (though not very idiomatic):

df %>% 
  split.default(!names(df) %in% c("nm","source","color","rep")) %>%
  map_at("TRUE", ~ .x - invoke(pmin,.x)) %>%
  bind_cols
#       nm source color rep X1799.38928 X1798.01526 X1796.64124
# 1 s001c1      1     c   1      18.610       5.722       0.000
# 2 s001c2      1     c   2      14.211       0.000      11.990
# 3 s001c3      1     c   3       0.000      16.820      21.784
# 4 s001c4      1     c   4       0.000      11.447      17.792
# 5 s001c5      1     c   5       6.804       0.000       3.581
2
votes

We could do this with pmin to get the rowwise minimum value and then use mutate_at to find the difference between the columns and minimum value

library(tidyverse)
ins <- x.df %>%
            select(starts_with("X")) %>% 
            reduce(pmin)
x.df %>% 
      mutate_at(vars(starts_with("X")), funs(. - mins))
#  nm X1799.38928 X1798.01526 X1796.64124 source color rep
#1 s001c1      18.610       5.722       0.000      1     c   1
#2 s001c2      14.211       0.000      11.990      1     c   2
#3 s001c3       0.000      16.820      21.784      1     c   3
#4 s001c4       0.000      11.447      17.792      1     c   4
#5 s001c5       6.804       0.000       3.581      1     c   5

Or combining it in a single chain

x.df %>% 
      mutate(mins = reduce(.[grepl("^X", names(.))], pmin)) %>% # get min by row
      mutate_at(vars(starts_with("X")), funs(. - mins)) %>% # take difference
      select(-mins) # remove the column mins

NOTE: pmin is originally posted in our post