0
votes

I have a data frame with various subjects, each of whom contributed at least one tissue sample, i.e. Blood, Heart, Liver, etc, while many of them contributed samples of multiple tissues. There are 31 unique tissues, and I want to create a 31 x 31 matrix indicating tissues pairs collected from a single subject. With row and column names being the names of the tissues, then, the diagonals would give the total number of subjects from whom a tissue sample was collected, and the off diagonals would include the number of subjects who had given both (i.e., if a subject had given a heart and lung sample, the intersection of the heart row/column and lung column/row would increase by 1).

So far, I have been able to get the data (using plyr) into a data frame counts that includes each unique pair found, along with the number of subjects who have contributed both tissue types. When SMTS1 and SMTS2 match, the value in Count indicates the total number of samples of that tissue

> head(counts, n = 32L)
        SMTS1           SMTS2      Count
1  Adipose Tissue  Adipose Tissue   439
2  Adipose Tissue   Adrenal Gland   137
3  Adipose Tissue         Bladder    11
4  Adipose Tissue           Blood   423
5  Adipose Tissue    Blood Vessel   368
6  Adipose Tissue           Brain   146
7  Adipose Tissue          Breast   190
8  Adipose Tissue    Cervix Uteri     8
9  Adipose Tissue           Colon   248
10 Adipose Tissue       Esophagus   341
11 Adipose Tissue  Fallopian Tube     6
12 Adipose Tissue           Heart   266
13 Adipose Tissue          Kidney    33
14 Adipose Tissue           Liver   119
15 Adipose Tissue            Lung   285
16 Adipose Tissue          Muscle   380
17 Adipose Tissue           Nerve   290
18 Adipose Tissue           Ovary    99
19 Adipose Tissue        Pancreas   174
20 Adipose Tissue       Pituitary   102
21 Adipose Tissue        Prostate   105
22 Adipose Tissue  Salivary Gland    64
23 Adipose Tissue            Skin   423
24 Adipose Tissue Small Intestine    97
25 Adipose Tissue          Spleen   110
26 Adipose Tissue         Stomach   182
27 Adipose Tissue          Testis   168
28 Adipose Tissue         Thyroid   290
29 Adipose Tissue          Uterus    81
30 Adipose Tissue          Vagina    86
31  Adrenal Gland  Adipose Tissue   137
32  Adrenal Gland   Adrenal Gland   159
... [823 Additional Rows]

The way this is set up, each of the 31 tissues is present in counts$SMTS1, and counts$SMTS2 contains all of the tissues for which a pair exists. You'll see for Adipose Tissue, there are only 30 entries, indicating that there is one tissue type that is not found with Adipose Tissue.

What I would like to do is make it so that each unique value in SMTS1 is paired with each of the 31 possible tissues. In the case shown, for example, Adipose Tissue only has 30 pairs, indicating that one pair does not exist. In this case, that pair is Bone Marrow. What I would like, then, is for my counts data frame, upon recognizing that, create two additional rows

        SMTS1           SMTS2       Count
1  Adipose Tissue     Bone Marrow     0
2    Bone Marrow     Adipose Tissue   0

giving 0 values indicating that a pair doesn't exist. From there, the, I should have 961 numeric values, which will ultimately end up being the entries for my 31 x 31 matrix.

Here is what I have tried

# Vector of 31 Tissues
tissues <- names(sampleTypes)
names(tissues) <- c("SMTS2")

# Replicate 31 times, one for each unique tissue in SMTS1
rep.tissues <- rep(tissues, 31)

# Make data frame column for merge
rep.df <- as.data.frame(t(rep.tissues)
names(rep.df) <- "SMTS2"

# Merge
match <- merge(counts, rep.df, by = "SMTS2", all.x = TRUE)

However, the output for this is large because of duplicates and, removing those, I'm left with a data frame that is identical to the original counts. Additionally, I realize that this does nothing to fill in the counts$Count value with a 0 for each new row created.

TL;DR I need to create all missing pairwise values and update a third column with a 0 for each row created. These will be used to fill in a 31 x 31 matrix showing which tissues had been collected together from the same subject.

1
tidyr::spread?Richard Telford
@RichardTelford I looked this up, and it appears as if it would take my counts data frame and basically turn it into the matrix that I want, which would be awesome. Do you know how this would handle the missing key pairs?cnolte
Pretty sure tidyr::complete would work here, hard to test without a dput of countsJake Kaupp
@RichardTelford Wow. Ok, that's it, I'm finished. That was incredible. If you put this as an answer, I will gladly accept it. THANK YOU!cnolte

1 Answers

2
votes

You can use tidyr::gather

#Some simulated data

library(tidyverse)# will conflict with plyr
df <- expand.grid(c1 = letters[1:4], c2 = letters[1:4]) %>% 
  mutate(Count = round(runif(16, 1,100))) %>% 
  slice(-c(3, 7, 12))# missing pairs

df %>% spread(key  = c2, value = Count, fill = 0)

# A tibble: 4 x 5
      c1     a     b     c     d
* <fctr> <dbl> <dbl> <dbl> <dbl>
1      a     5    16    18    16
2      b    23    38    58    93
3      c     0     0    81    47
4      d    78    32     0    34

fill argument puts zeros in where there is no data