13
votes

I've got an input file with a list of ~50000 clusters and presence of a number of factors in each of them (~10 million entries in total), see a smaller example below:

set.seed(1)
x = paste("cluster-",sample(c(1:100),500,replace=TRUE),sep="")
y = c(
  paste("factor-",sample(c(letters[1:3]),300, replace=TRUE),sep=""),
  paste("factor-",sample(c(letters[1]),100, replace=TRUE),sep=""),
  paste("factor-",sample(c(letters[2]),50, replace=TRUE),sep=""),
  paste("factor-",sample(c(letters[3]),50, replace=TRUE),sep="")
)
data = data.frame(cluster=x,factor=y)

With a bit of help from another question, I got it to produce a piechart for co-occurrence of factors like this:

counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse='+'))))
pie(counts[counts>1])

But now I would like to have a venn diagram for the co-occurrence of factors. Ideally, also in a way that can take a threshold for the minimum count for each factor. For example, a venn diagram for the different factors so that each one of them has to be present n>10 in each cluster to be taken into account.

I've tried to find a way to produce the table counts with aggregate, but couldn't make it work.

1
Have you looked at any of the R packages for Venn diagrams? See this recent example by G. Jay Kerns using the venneuler library, or this brief article in the Journal of Stat Software using the venn library (Murdoch, 2004). If this is purely about R programming it should be migrated to SO.Andy W
Avilella, this question might not be getting any answers because it's marginally off topic. You might do better on SO, which has an active R user community. But please don't cross-post: just flag the question for moderator attention if you would like it migrated.whuber
I flagged it, but I can't see it being moved to SO yet...719016
Hmmm... No flag was raised. I came here only because I remembered to. Anyway, here we go. I'm confident you'll get some good replies on SO.whuber
@avilella -- Do the solutions below fit the bill? If you have another Venn diagram package in mind, and can't wrestle the data into the appropriate form, please let me know. Thanks.Josh O'Brien

1 Answers

22
votes

I've provided two solutions, using two different packages with Venn diagram capabilities. As you expected, both involve an initial step using the aggregate() function.

I tend to prefer the results from the venneuler package. It's default label positions aren't ideal, but you could adjust them by having a look at the associated plot method (possibly using locator() to select the coordinates).

Solution the 1st:

One possibility is to use venneuler() in the venneuler package to draw your Venn diagram.

library(venneuler)

## Modify the "factor" column, by renaming it and converting
## it to a character vector.
levels(data$factor) <- c("a", "b", "c")
data$factor <- as.character(data$factor)

## FUN is an anonymous function that determines which letters are present
## 2 or more times in the cluster and then pastes them together into 
## strings of a form that venneuler() expects.
##
inter <- aggregate(factor ~ cluster, data=data,
                   FUN = function(X) {
                       tab <- table(X)
                       names <- names(tab[tab>=2])
                       paste(sort(names), collapse="&")
                   })            
## Count how many clusters contain each combination of letters
counts <- table(inter$factor)
counts <- counts[names(counts)!=""]  # To remove groups with <2 of any letter
#  a   a&b a&b&c   a&c     b   b&c     c 
# 19    13    12    14    13     9    12 

## Convert to proportions for venneuler()
ps <- counts/sum(counts)

## Calculate the Venn diagram
vd <- venneuler(c(a=ps[["a"]], b = ps[["b"]], c = ps[["c"]],
                  "a&b" = ps[["a&b"]],
                  "a&c" = ps[["a&c"]],
                  "b&c" = ps[["b&c"]],
                  "a&b&c" = ps[["a&b&c"]]))
## Plot it!
plot(vd)

A few notes about choices I made in writing this code:

  • I've changed the names of factors from "factor-a" to "a". You can obviously change that back.

  • I've only required each factor to be present >=2 times (instead of >10) to be counted within each cluster. (That was to demonstrate the code with this small subset of your data.)

  • If you take a look at the intermediate object counts, you'll see that it contains an initial unnamed element. That element is the number of clusters that contain fewer than 2 of any letter. You can decide better than I whether or not you want to include those in the calculation of the subsequent ps ('proportions') object.

enter image description here

Solution the 2nd:

Another possibility is to employ vennCounts() and vennDiagram() in the Bioconductor package limma. To download the package, follow the instructions here. Unlike the venneuler solution above, the overlap in the resultant diagram is not proportional to the actual degree of intersection. Instead, it annotates the diagram with the actual frequencies. (Note that this solution does not involve any edits to the data$factor column.)

library(limma)

out <- aggregate(factor ~ cluster, data=data, FUN=table)
out <- cbind(out[1], data.frame(out[2][[1]]))

counts <- vennCounts(out[, -1] >= 2)
vennDiagram(counts, names = c("Factor A", "Factor B", "Factor C"),
            cex = 1, counts.col = "red")

enter image description here