4
votes

I have some data which represents repeated (session A and B) measurements (var.x) of patients (id). My data is tidy (i.e. long-form table with one variable in each column and one observation in each row) ... like so:

| id | var.1 | var.2 | session |
--------------------------------
|  1 |  1.1  |   11  |    A    |
|  1 |  1.9  |   12  |    B    |
|  2 |  1.2  |   15  |    A    |
|  2 |  1.4  |   14  |    B    |

I want to do a correlation analysis of paired (by id) data (e.g. var.1) from session A and B. In other words, what is the correlation coefficient between repeated measurements..

This is not difficult to do, if I tranpose/pivot/melt/spread the data into a wide format like so:

| id | var.1.A | var.2.A | var.1.B | var.2.B |
----------------------------------------------
|  1 |  1.1    |   11    |   1.9   |   12    |
|  2 |  1.2    |   15    |   1.4   |   14    |

...then I can simply cor(var.1.A, var.1.B). And I realize that dplyr/tidyr have gather/spread/separate/unite commands for that purpose, however this seems untidy and less than elegant to me as I basically have to copy my data.

Is there a way to do this using dplyr, without reshaping the data?

I'm thinking something along the lines of:

data %>% 
    select(id, var.1, session) %>%
    do( cor( filter(session=='A', filter(session=='B')) 

...but obviously this doesn't work and I also need to somehow specify, that the data is paired (by id).

2
Try data %>% select(id, var.1, session) %>% do({data.frame(Cor=cor(.$var.1[.$session=='A'], .$var.1[.$session=='B']))}) If you need to do this by 'id', use group_by after selectakrun
If you are using data.table setDT(data)[, Map(function(x,y) cor(x[y=='A'], x[y=='B']), .SD[, c('var.1', 'var.2'), with=FALSE], list(session)) ]akrun
Any operation you will do will make a temporary copy of your data in the memory. Converting your data to a wide format will be probably more efficient and concise then running by row/column operations. You don't have to store the wide format anywhere, rather just run some operations until you reach desired output and it will be wiped out from memory by itself.David Arenburg
I was under the impression, that dplyr generally manages to index data in an efficient way, rather than duplicate it. But in any case, duplicating the data can only introduce opportunities to make a mess of the data :-)Søren ONeill

2 Answers

1
votes

We can try either dplyr or data.table methods. Assuming that we want to find the correlation for the 'var' columns, we create an index of the columns using grep ('nm1'), and a custom function (f1). We use Map within do for getting the cor of each 'var' column. We use the 'var' columns as first input i.e. 'x' and 'session' as the 'y'.

nm1 <- grep('^var', names(data))
f1 <- function(x,y) cor(x[y=='A'], x[y=='B']) 
library(dplyr)
data %>% 
      do({data.frame(Map(f1, .[nm1], list(.$session)))})
#   var.1 var.2
#1    -1     1

In a similar way, we can do the cor using data.table.

library(data.table)
setDT(data)[, Map(f1, .SD[, nm1, with=FALSE], list(session)) ]
#   var.1 var.2
#1:    -1     1

If we are using base R

mapply(f1, data[nm1], list(data$session))
# var.1 var.2 
# -1     1 

If we want to convert to wide format, we can use dcast from the devel version of data.table i.e. v1.9.5 as it can take multiple value.var columns.

 dW <- dcast(setDT(data), id~session, value.var=c('var.1', 'var.2'))[, id:=NULL]
 dW
 #   var.1_A var.1_B var.2_A var.2_B
 #1:     1.1     1.9      11      12
 #2:     1.2     1.4      15      14

We can convert to matrix, subset, apply cor and get the diag values.

 m1 <- as.matrix(dW)
 diag(cor(m1[,c(TRUE, FALSE)], m1[,c(FALSE, TRUE)]))
 #[1] -1  1

data

data <- structure(list(id = c(1L, 1L, 2L, 2L), var.1 = c(1.1, 1.9, 1.2, 
1.4), var.2 = c(11L, 12L, 15L, 14L), session = c("A", "B", "A", 
"B")), .Names = c("id", "var.1", "var.2", "session"), 
class = "data.frame", row.names = c(NA, -4L))
2
votes

I thought I'd add an answer showing how to do this using summarise_each from dplyr.

This approach is dependent on the dataset order. In a case like this I like to use arrange to make sure the dataset is in the desired order, although this can be skipped if your are confident things are already in order.

data %>%
    arrange(id, session) %>%
    summarise_each(funs(cor(.[session == "A"], .[session == "B"])), starts_with("var"))

  var.1 var.2
1    -1     1

In summarise_each, the . refers to the variable you are summarizing, where in do it refers to the dataset.

You can simply write out the variables you want to summarize by name, but I used the function starts_with to pick the variables. See the help page for select for more on these special functions.

You also will need to make sure you have a pair within each ID. The complete function from package tidyr may be useful if you run into a situation where one of the session levels is missing within an id. If this is the case, you'll likely need to make use of use argument in cor.