1
votes

A matrix cov_mat stores the covariances between variables:

        a_plane a_boat  a_train b_plane b_boat  b_train c_plane c_boat  c_train d_plane …
a_plane   4.419 -0.583    0.446  0.018  -1.291    3.159  -0.954  0.488    3.111   1.100 
a_boat   -0.583  2.636    1.813 -1.511  -0.420   -0.757   1.698  1.668    1.091   0.120 
a_train   0.446  1.813    2.668 -0.365  -0.183    1.040   1.347  1.813    0.806  -0.324 
b_plane   0.018 -1.511   -0.365  2.498   1.153    1.498  -0.465 -1.157   -0.775   0.133 
b_boat   -1.291 -0.420   -0.183  1.153   1.043   -0.194   0.243 -0.361   -0.981  -0.040 
b_train   3.159 -0.757    1.040  1.498  -0.194    4.153  -0.208  0.257    1.922   1.434 
c_plane  -0.954  1.698    1.347 -0.465   0.243   -0.208   1.791  0.909    0.259   0.394 
c_boat    0.488  1.668    1.813 -1.157  -0.361    0.257   0.909  2.290    1.572   0.269 
c_train   3.111  1.091    0.806 -0.775  -0.981    1.922   0.259  1.572    4.097   2.001 
d_plane   1.100  0.120   -0.324  0.133  -0.040    1.434   0.394  0.269    2.001   2.231 
…

final_need is a data frame with a row for each category of transportation (plane, boat, train) and a column for every possible covariance within a given category:

           aa       ab      ac     ad       ba     bb       bc     bd       ca     cb   …   <dd>
plane   4.419    0.018  -0.954  1.100    0.018  2.498   -0.465  0.133   -0.954  -0.465  …   
boat    2.636   -0.420   1.668  0.120   -0.420  1.043   -0.361  …               
train   …                                           
<…> 

To get from cov_mat to final_need, I've converted the file to an edgelist via igraph, then eliminated rows of that edgelist that included out-of-category covariance calculations (e.g.,a_planecovaries witha_boat`, but I could care less). Here's the result:

> head(cov_edgelist_slim)

   from      to covariance
a_plane a_plane      4.419
a_plane b_plane      0.018
a_plane c_plane     -0.954
a_plane d_plane      1.100
b_plane a_plane          …
…       …                …

I then try to use dcast() from reshape2, but am getting stuck on how to use the function to produce the final_need result. Any thoughts? If there's a simpler way than the one I'm heading down, I'm happy to hear it!

2

2 Answers

2
votes

Here is another approach using base R:

  1. Extract the sub-matrix of covariances for each individual vehicle;
  2. Expand this sub-matrix to a vector;
  3. Combine the (row) vectors back into a matrix.

The additional code is to get the right column names for final_need based on pasting together the combinations of the prefixes.

## sort row + colnames in alphabetical order
cov_mat <- cov_mat[sort(rownames(cov_mat)), sort(colnames(cov_mat))]

## unique prefix and vehicle names
prefix <- unique(sub("_\\w+$", "", colnames(cov_mat)))
vehicNames <- unique(sub("^\\w+?_", "", colnames(cov_mat)))

## create final_need
final_need <- t(sapply(vehicNames, function(vehic) {           
          indices <- grep(vehic, colnames(cov_mat))      
          as.vector(cov_mat[indices, indices])
        }))

## add prefix combinations as column names
colnames(final_need) <- levels(interaction(prefix, prefix, sep = ""))

final_need
#>          aa     ba     ca     ab    bb     cb     ac     bc    cc
#> boat  2.636 -0.420  1.668 -0.420 1.043 -0.361  1.668 -0.361 2.290
#> plane 4.419  0.018 -0.954  0.018 2.498 -0.465 -0.954 -0.465 1.791
#> train 2.668  1.040  0.806  1.040 4.153  1.922  0.806  1.922 4.097

EDIT: the same can be done the other way around, i.e. extract the sub-matrix of covariances for each prefix combination and combine their diagonals back into a matrix (by column).

## create final_need by column
final_need <- apply(expand.grid(prefix, prefix), 1, function(i) {
      row_ids <- grep(sprintf("^%s_", i[1]), rownames(cov_mat))
      col_ids <- grep(sprintf("^%s_", i[2]), colnames(cov_mat))
      cov_mat[cbind(row_ids, col_ids)]
    })

## add row + column names
dimnames(final_need) <- list(vehicNames, levels(interaction(prefix, prefix, sep = "")))

final_need
#>          aa     ba     ca     ab    bb     cb     ac     bc    cc
#> boat  2.636 -0.420  1.668 -0.420 1.043 -0.361  1.668 -0.361 2.290
#> plane 4.419  0.018 -0.954  0.018 2.498 -0.465 -0.954 -0.465 1.791
#> train 2.668  1.040  0.806  1.040 4.153  1.922  0.806  1.922 4.097
1
votes

A tidyverse approach could be to bring rownames as column, convert data to long format. Get the first part of string from both column names as well rownames, keep only those rows where both of them match and convert to wide format.

library(tidyverse)

df %>%
  rownames_to_column() %>%
  pivot_longer(cols =-rowname) %>%
  mutate(key = paste0(sub("_.*", "", rowname), sub("_.*", "", name)), 
         rowname = sub(".*_", "", rowname), name = sub(".*_", "", name)) %>%
  filter(rowname == name) %>%
  select(-rowname) %>%
  pivot_wider(names_from = key, values_from = value)

# A tibble: 3 x 17
#  name    aa     ab     ac    ad     ba    bb     bc     bd     ca     cb ....
#  <chr> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#1 plane  4.42  0.018 -0.954   1.1  0.018  2.50 -0.465  0.133 -0.954 -0.465
#2 boat   2.64 -0.42   1.67   NA   -0.42   1.04 -0.361 NA      1.67  -0.361
#3 train  2.67  1.04   0.806  NA    1.04   4.15  1.92  NA      0.806  1.92