How to manipulate long-table data for correlations, using dplyr, but without transposing data to wide format

Question

I have some data which represents repeated (session A and B) measurements (var.x) of patients (id). My data is tidy (i.e. long-form table with one variable in each column and one observation in each row) ... like so:

| id | var.1 | var.2 | session |
--------------------------------
|  1 |  1.1  |   11  |    A    |
|  1 |  1.9  |   12  |    B    |
|  2 |  1.2  |   15  |    A    |
|  2 |  1.4  |   14  |    B    |

I want to do a correlation analysis of paired (by id) data (e.g. var.1) from session A and B. In other words, what is the correlation coefficient between repeated measurements..

This is not difficult to do, if I tranpose/pivot/melt/spread the data into a wide format like so:

| id | var.1.A | var.2.A | var.1.B | var.2.B |
----------------------------------------------
|  1 |  1.1    |   11    |   1.9   |   12    |
|  2 |  1.2    |   15    |   1.4   |   14    |

...then I can simply cor(var.1.A, var.1.B). And I realize that dplyr/tidyr have gather/spread/separate/unite commands for that purpose, however this seems untidy and less than elegant to me as I basically have to copy my data.

Is there a way to do this using dplyr, without reshaping the data?

I'm thinking something along the lines of:

data %>% 
    select(id, var.1, session) %>%
    do( cor( filter(session=='A', filter(session=='B'))

...but obviously this doesn't work and I also need to somehow specify, that the data is paired (by id).

Try data %>% select(id, var.1, session) %>% do({data.frame(Cor=cor(.$var.1[.$session=='A'], .$var.1[.$session=='B']))}) If you need to do this by 'id', use group_by after select — akrun
If you are using data.table setDT(data)[, Map(function(x,y) cor(x[y=='A'], x[y=='B']), .SD[, c('var.1', 'var.2'), with=FALSE], list(session)) ] — akrun
Any operation you will do will make a temporary copy of your data in the memory. Converting your data to a wide format will be probably more efficient and concise then running by row/column operations. You don't have to store the wide format anywhere, rather just run some operations until you reach desired output and it will be wiped out from memory by itself. — David Arenburg
I was under the impression, that dplyr generally manages to index data in an efficient way, rather than duplicate it. But in any case, duplicating the data can only introduce opportunities to make a mess of the data :-) — Søren ONeill

akrun akrun · Accepted Answer · 2015-09-04T07:26:54

We can try either dplyr or data.table methods. Assuming that we want to find the correlation for the 'var' columns, we create an index of the columns using grep ('nm1'), and a custom function (f1). We use Map within do for getting the cor of each 'var' column. We use the 'var' columns as first input i.e. 'x' and 'session' as the 'y'.

nm1 <- grep('^var', names(data))
f1 <- function(x,y) cor(x[y=='A'], x[y=='B']) 
library(dplyr)
data %>% 
      do({data.frame(Map(f1, .[nm1], list(.$session)))})
#   var.1 var.2
#1    -1     1

In a similar way, we can do the cor using data.table.

library(data.table)
setDT(data)[, Map(f1, .SD[, nm1, with=FALSE], list(session)) ]
#   var.1 var.2
#1:    -1     1

If we are using base R

mapply(f1, data[nm1], list(data$session))
# var.1 var.2 
# -1     1

If we want to convert to wide format, we can use dcast from the devel version of data.table i.e. v1.9.5 as it can take multiple value.var columns.

 dW <- dcast(setDT(data), id~session, value.var=c('var.1', 'var.2'))[, id:=NULL]
 dW
 #   var.1_A var.1_B var.2_A var.2_B
 #1:     1.1     1.9      11      12
 #2:     1.2     1.4      15      14

We can convert to matrix, subset, apply cor and get the diag values.

 m1 <- as.matrix(dW)
 diag(cor(m1[,c(TRUE, FALSE)], m1[,c(FALSE, TRUE)]))
 #[1] -1  1

data

data <- structure(list(id = c(1L, 1L, 2L, 2L), var.1 = c(1.1, 1.9, 1.2, 
1.4), var.2 = c(11L, 12L, 15L, 14L), session = c("A", "B", "A", 
"B")), .Names = c("id", "var.1", "var.2", "session"), 
class = "data.frame", row.names = c(NA, -4L))

How to manipulate long-table data for correlations, using dplyr, but without transposing data to wide format

2 Answers

data