Creating a matrix of co-occurence from many variables, and plotting it

Question

I have a data set of individuals with a number of health conditions. Individuals either do (1) or do not (0) have each condition (my real data set has 14). What I want to do is summarise the data so I know how often pairs of conditions occur. Note that some individuals may have three or four of the conditions, but what I'm interested in is the pairwise co-occurence. I would then like to plot this as a heatmap.

I suspect that the solution involves the 'gather' function from tidyr, but I haven't been able to work it out. This is an example of what my input looks like and what I'd like to achieve:

Here's some data on individuals and whether or not they have conditions "a", "b" or "c":

library(tidyverse)
library(viridis)

dat <- tibble(
  id = c(1:15),
  a = c(1,0,0,0,1,1,1,0,1,0,0,0,1,0,1),
  b = c(1,0,0,1,1,1,0,0,1,0,0,1,1,0,1),
  c = c(0,0,1,1,0,1,0,1,0,1,1,0,1,1,0))

I want to summarise how often each of the conditions occur, and how often they co-occur. In this case, it's evident that conditions "a" and "b" co-occur more often than do either of these with "c", which usually occurs on its own. Below is my imagined idea of what the data will look like in a plottable format. The first column is 'variable 1', the second is 'variable 2', and the third, is the count of how often these occur together. Below that is the plot which I have in my mind.

plotdat <- tibble(
  var1 = c("a", "a", "a", "b", "b", "c"),
  var2 = c("a", "b", "c", "b", "c", "c"),
  count = c(7, 6, 2, 8, 3, 8))

ggplot(plotdat) +
  geom_tile(aes(var1, var2, fill = count)) +
  scale_fill_viridis()

Perhaps this is not the right approach at all and I actually need to convert the data into a 3x3 matrix. Any possible solutions would be gratefully received!

markus markus · Accepted Answer · 2019-02-08T10:44:58

Here is a way

library(tidyverse)
as.matrix(dat[-1]) %>% 
  crossprod() %>% 
  `[<-`(upper.tri(.), NA) %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  gather(key, value, -rowname) %>%
  filter(!is.na(value))
#  rowname key value
#1       a   a     7
#2       b   a     6
#3       c   a     2
#4       b   b     8
#5       c   b     3
#6       c   c     8

The most important piece is crossprod, I think. But let's go through it step by step.

You don't need column id so we exclude it and convert dat[-1] to a matrix because this is what crossprod expects.

as.matrix(dat[-1]) %>% 
  crossprod()
#  a b c
#a 7 6 2
#b 6 8 3
#c 2 3 8

Then we replace the upper triangle of this matrix with NA because you don't want to compare a-b and b-a etc.

The next step is to convert to a dataframe, make the rownames a column and reshape from wide to long

as.matrix(dat[-1]) %>% 
  crossprod() %>% 
  `[<-`(upper.tri(.), NA) %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  gather(key, value, -rowname)
#  rowname key value
#1       a   a     7
#2       b   a     6
#3       c   a     2
#4       a   b    NA
#5       b   b     8
#6       c   b     3
#7       a   c    NA
#8       b   c    NA
#9       c   c     8

Finally remove NAs to get desired output.

Creating a matrix of co-occurence from many variables, and plotting it

1 Answers