I have a data frame with various subjects, each of whom contributed at least one tissue sample, i.e. Blood, Heart, Liver, etc, while many of them contributed samples of multiple tissues. There are 31 unique tissues, and I want to create a 31 x 31 matrix indicating tissues pairs collected from a single subject. With row and column names being the names of the tissues, then, the diagonals would give the total number of subjects from whom a tissue sample was collected, and the off diagonals would include the number of subjects who had given both (i.e., if a subject had given a heart and lung sample, the intersection of the heart row/column and lung column/row would increase by 1).
So far, I have been able to get the data (using plyr
) into a data frame counts
that includes each unique pair found, along with the number of subjects who have contributed both tissue types. When SMTS1
and SMTS2
match, the value in Count
indicates the total number of samples of that tissue
> head(counts, n = 32L)
SMTS1 SMTS2 Count
1 Adipose Tissue Adipose Tissue 439
2 Adipose Tissue Adrenal Gland 137
3 Adipose Tissue Bladder 11
4 Adipose Tissue Blood 423
5 Adipose Tissue Blood Vessel 368
6 Adipose Tissue Brain 146
7 Adipose Tissue Breast 190
8 Adipose Tissue Cervix Uteri 8
9 Adipose Tissue Colon 248
10 Adipose Tissue Esophagus 341
11 Adipose Tissue Fallopian Tube 6
12 Adipose Tissue Heart 266
13 Adipose Tissue Kidney 33
14 Adipose Tissue Liver 119
15 Adipose Tissue Lung 285
16 Adipose Tissue Muscle 380
17 Adipose Tissue Nerve 290
18 Adipose Tissue Ovary 99
19 Adipose Tissue Pancreas 174
20 Adipose Tissue Pituitary 102
21 Adipose Tissue Prostate 105
22 Adipose Tissue Salivary Gland 64
23 Adipose Tissue Skin 423
24 Adipose Tissue Small Intestine 97
25 Adipose Tissue Spleen 110
26 Adipose Tissue Stomach 182
27 Adipose Tissue Testis 168
28 Adipose Tissue Thyroid 290
29 Adipose Tissue Uterus 81
30 Adipose Tissue Vagina 86
31 Adrenal Gland Adipose Tissue 137
32 Adrenal Gland Adrenal Gland 159
... [823 Additional Rows]
The way this is set up, each of the 31 tissues is present in counts$SMTS1
, and counts$SMTS2
contains all of the tissues for which a pair exists. You'll see for Adipose Tissue, there are only 30 entries, indicating that there is one tissue type that is not found with Adipose Tissue.
What I would like to do is make it so that each unique value in SMTS1
is paired with each of the 31 possible tissues. In the case shown, for example, Adipose Tissue only has 30 pairs, indicating that one pair does not exist. In this case, that pair is Bone Marrow. What I would like, then, is for my counts
data frame, upon recognizing that, create two additional rows
SMTS1 SMTS2 Count
1 Adipose Tissue Bone Marrow 0
2 Bone Marrow Adipose Tissue 0
giving 0
values indicating that a pair doesn't exist. From there, the, I should have 961 numeric values, which will ultimately end up being the entries for my 31 x 31 matrix.
Here is what I have tried
# Vector of 31 Tissues
tissues <- names(sampleTypes)
names(tissues) <- c("SMTS2")
# Replicate 31 times, one for each unique tissue in SMTS1
rep.tissues <- rep(tissues, 31)
# Make data frame column for merge
rep.df <- as.data.frame(t(rep.tissues)
names(rep.df) <- "SMTS2"
# Merge
match <- merge(counts, rep.df, by = "SMTS2", all.x = TRUE)
However, the output for this is large because of duplicates and, removing those, I'm left with a data frame that is identical to the original counts
. Additionally, I realize that this does nothing to fill in the counts$Count
value with a 0
for each new row created.
TL;DR I need to create all missing pairwise values and update a third column with a 0
for each row created. These will be used to fill in a 31 x 31 matrix showing which tissues had been collected together from the same subject.
tidyr::spread
? – Richard Telfordcounts
data frame and basically turn it into the matrix that I want, which would be awesome. Do you know how this would handle the missing key pairs? – cnoltetidyr::complete
would work here, hard to test without adput
of counts – Jake Kaupp