0
votes

I have two numeric data sets. Df1 and Df2 contain 15000 columns each. I now want to calculate correlations but only between column 1 of Df1 and column 1 of Df2, then between column 2 Df1 and column 2 Df2 and so on. This for all 15000 columns. Creating a correlation matrix generates a lot of unwanted correlations. Therefore I am looking for a more elegant solution.

Can anyone help me here?

Thanks in advance H.

2

2 Answers

1
votes

You should provide reproducible data by extracting a few rows/cols of your data to illustrate what you have tried. Or just make up data with a similar structure, e.g.:

set.seed(42)
Df1 <- data.frame(matrix(runif(50), 10, 5))
Df2 <- data.frame(matrix(runif(50), 10, 5))

Now use sapply:

idx <- ncol(Df1)
result <- sapply(seq(idx), function(i) cor(Df1[, i], Df2[, i]))
result
# [1]  0.24864047 -0.40809796  0.03718413 -0.09967868  0.46627380
0
votes

Another solution based on purrr (using dcarlson's data):

library(purrr)

map2_dbl(
  .x = Df1,
  .y = Df2,
  ~ cor(.x, .y)
  )

This returns

#>         X1          X2          X3          X4          X5 
#> 0.24864047 -0.40809796  0.03718413 -0.09967868  0.46627380