176
votes

I have a dataframe with multiple columns. For each row in the dataframe, I want to call a function on the row, and the input of the function is using multiple columns from that row. For example, let's say I have this data and this testFunc which accepts two args:

> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
  x y z
1 1 3 5
2 2 4 6
> testFunc <- function(a, b) a + b

Let's say I want to apply this testFunc to columns x and z. So, for row 1 I want 1+5, and for row 2 I want 2 + 6. Is there a way to do this without writing a for loop, maybe with the apply function family?

I tried this:

> df[,c('x','z')]
  x z
1 1 5
2 2 6
> lapply(df[,c('x','z')], testFunc)
Error in a + b : 'b' is missing

But got error, any ideas?

EDIT: the actual function I want to call is not a simple sum, but it is power.t.test. I used a+b just for example purposes. The end goal is to be able to do something like this (written in pseudocode):

df = data.frame(
    delta=c(delta_values), 
    power=c(power_values), 
    sig.level=c(sig.level_values)
)

lapply(df, power.t.test(delta_from_each_row_of_df, 
                        power_from_each_row_of_df, 
                        sig.level_from_each_row_of_df
))

where the result is a vector of outputs for power.t.test for each row of df.

12
See also stackoverflow.com/a/24728107/946850 for the dplyr way.krlmlr

12 Answers

146
votes

You can apply apply to a subset of the original data.

 dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
 apply(dat[,c('x','z')], 1, function(x) sum(x) )

or if your function is just sum use the vectorized version:

rowSums(dat[,c('x','z')])
[1] 6 8

If you want to use testFunc

 testFunc <- function(a, b) a + b
 apply(dat[,c('x','z')], 1, function(x) testFunc(x[1],x[2]))

EDIT To access columns by name and not index you can do something like this:

 testFunc <- function(a, b) a + b
 apply(dat[,c('x','z')], 1, function(y) testFunc(y['z'],y['x']))
112
votes

A data.frame is a list, so ...

For vectorized functions do.call is usually a good bet. But the names of arguments come into play. Here your testFunc is called with args x and y in place of a and b. The ... allows irrelevant args to be passed without causing an error:

do.call( function(x,z,...) testFunc(x,z), df )

For non-vectorized functions, mapply will work, but you need to match the ordering of the args or explicitly name them:

mapply(testFunc, df$x, df$z)

Sometimes apply will work - as when all args are of the same type so coercing the data.frame to a matrix does not cause problems by changing data types. Your example was of this sort.

If your function is to be called within another function into which the arguments are all passed, there is a much slicker method than these. Study the first lines of the body of lm() if you want to go that route.

34
votes

Use mapply

> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
  x y z
1 1 3 5
2 2 4 6
> mapply(function(x,y) x+y, df$x, df$z)
[1] 6 8

> cbind(df,f = mapply(function(x,y) x+y, df$x, df$z) )
  x y z f
1 1 3 5 6
2 2 4 6 8
22
votes

New answer with dplyr package

If the function that you want to apply is vectorized, then you could use the mutate function from the dplyr package:

> library(dplyr)
> myf <- function(tens, ones) { 10 * tens + ones }
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mutate(x, value = myf(tens, ones))
  hundreds tens ones value
1        7    1    4    14
2        8    2    5    25
3        9    3    6    36

Old answer with plyr package

In my humble opinion, the tool best suited to the task is mdply from the plyr package.

Example:

> library(plyr)
> x <- data.frame(tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
  tens ones V1
1    1    4 14
2    2    5 25
3    3    6 36

Unfortunately, as Bertjan Broeksema pointed out, this approach fails if you don't use all the columns of the data frame in the mdply call. For example,

> library(plyr)
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
Error in (function (tens, ones)  : unused argument (hundreds = 7)
14
votes

Others have correctly pointed out that mapply is made for this purpose, but (for the sake of completeness) a conceptually simpler method is just to use a for loop.

for (row in 1:nrow(df)) { 
    df$newvar[row] <- testFunc(df$x[row], df$z[row]) 
}
12
votes

Many functions are vectorization already, and so there is no need for any iterations (neither for loops or *pply functions). Your testFunc is one such example. You can simply call:

  testFunc(df[, "x"], df[, "z"])

In general, I would recommend trying such vectorization approaches first and see if they get you your intended results.


Alternatively, if you need to pass multiple arguments to a function which is not vectorized, mapply might be what you are looking for:

  mapply(power.t.test, df[, "x"], df[, "z"])
5
votes

Here is an alternate approach. It is more intuitive.

One key aspect I feel some of the answers did not take into account, which I point out for posterity, is apply() lets you do row calculations easily, but only for matrix (all numeric) data

operations on columns are possible still for dataframes:

as.data.frame(lapply(df, myFunctionForColumn()))

To operate on rows, we make the transpose first.

tdf<-as.data.frame(t(df))
as.data.frame(lapply(tdf, myFunctionForRow()))

The downside is that I believe R will make a copy of your data table. Which could be a memory issue. (This is truly sad, because it is programmatically simple for tdf to just be an iterator to the original df, thus saving memory, but R does not allow pointer or iterator referencing.)

Also, a related question, is how to operate on each individual cell in a dataframe.

newdf <- as.data.frame(lapply(df, function(x) {sapply(x, myFunctionForEachCell()}))
5
votes

data.table has a really intuitive way of doing this as well:

library(data.table)

sample_fxn = function(x,y,z){
    return((x+y)*z)
}

df = data.table(A = 1:5,B=seq(2,10,2),C = 6:10)
> df
   A  B  C
1: 1  2  6
2: 2  4  7
3: 3  6  8
4: 4  8  9
5: 5 10 10

The := operator can be called within brackets to add a new column using a function

df[,new_column := sample_fxn(A,B,C)]
> df
   A  B  C new_column
1: 1  2  6         18
2: 2  4  7         42
3: 3  6  8         72
4: 4  8  9        108
5: 5 10 10        150

It's also easy to accept constants as arguments as well using this method:

df[,new_column2 := sample_fxn(A,B,2)]

> df
   A  B  C new_column new_column2
1: 1  2  6         18           6
2: 2  4  7         42          12
3: 3  6  8         72          18
4: 4  8  9        108          24
5: 5 10 10        150          30
4
votes

@user20877984's answer is excellent. Since they summed it up far better than my previous answer, here is my (posibly still shoddy) attempt at an application of the concept:

Using do.call in a basic fashion:

powvalues <- list(power=0.9,delta=2)
do.call(power.t.test,powvalues)

Working on a full data set:

# get the example data
df <- data.frame(delta=c(1,1,2,2), power=c(.90,.85,.75,.45))

#> df
#  delta power
#1     1  0.90
#2     1  0.85
#3     2  0.75
#4     2  0.45

lapply the power.t.test function to each of the rows of specified values:

result <- lapply(
  split(df,1:nrow(df)),
  function(x) do.call(power.t.test,x)
)

> str(result)
List of 4
 $ 1:List of 8
  ..$ n          : num 22
  ..$ delta      : num 1
  ..$ sd         : num 1
  ..$ sig.level  : num 0.05
  ..$ power      : num 0.9
  ..$ alternative: chr "two.sided"
  ..$ note       : chr "n is number in *each* group"
  ..$ method     : chr "Two-sample t test power calculation"
  ..- attr(*, "class")= chr "power.htest"
 $ 2:List of 8
  ..$ n          : num 19
  ..$ delta      : num 1
  ..$ sd         : num 1
  ..$ sig.level  : num 0.05
  ..$ power      : num 0.85
... ...
4
votes

I came here looking for tidyverse function name - which I knew existed. Adding this for (my) future reference and for tidyverse enthusiasts: purrrlyr:invoke_rows (purrr:invoke_rows in older versions).

With connection to standard stats methods as in the original question, the broom package would probably help.

2
votes

If data.frame columns are different types, apply() has a problem. A subtlety about row iteration is how apply(a.data.frame, 1, ...) does implicit type conversion to character types when columns are different types; eg. a factor and numeric column. Here's an example, using a factor in one column to modify a numeric column:

mean.height = list(BOY=69.5, GIRL=64.0)

subjects = data.frame(gender = factor(c("BOY", "GIRL", "GIRL", "BOY"))
         , height = c(71.0, 59.3, 62.1, 62.1))

apply(height, 1, function(x) x[2] - mean.height[[x[1]]])

The subtraction fails because the columns are converted to character types.

One fix is to back-convert the second column to a number:

apply(subjects, 1, function(x) as.numeric(x[2]) - mean.height[[x[1]]])

But the conversions can be avoided by keeping the columns separate and using mapply():

mapply(function(x,y) y - mean.height[[x]], subjects$gender, subjects$height)

mapply() is needed because [[ ]] does not accept a vector argument. So the column iteration could be done before the subtraction by passing a vector to [], by a bit more ugly code:

subjects$height - unlist(mean.height[subjects$gender])
2
votes

A really nice function for this is adply from plyr, especially if you want to append the result to the original dataframe. This function and its cousin ddply have saved me a lot of headaches and lines of code!

df_appended <- adply(df, 1, mutate, sum=x+z)

Alternatively, you can call the function you desire.

df_appended <- adply(df, 1, mutate, sum=testFunc(x,z))