Creating graph with iGraph and computing similarity metric

Question

I am trying to create a graph in igraph using a csv file that looks like this:

ID    Element1 Element2 Element3 Element4
12346  A        12       56       2
13007  Y        16       66       2
...    ...      ...      ...      ...

ID column is populated by unique 4-digit identifiers, whereas the element columns are populated by numbers (or letters, in Element1) that are repeated. My goal is to compute the pairwise Jaccard similarity of all the IDs, which leverages the elements shared between the ID nodes. The output should be a NxN matrix.

I have been trying to create the graph on igraph using the graph_from_data_frame feature but this creates nodes from the first two columns and places the remaining columns as edge attributes within the relationships between the nodes it creates. Any ideas on the best way to create a graph that will allow me to compute the Jaccard between the ID nodes?

For reference, the goal is to use this feature of igraph:

 similarity(graph, vids = V(graph), mode = c("all", "out", "in", "total"),
 loops = FALSE, method = c("jaccard", "dice", "invlogweighted"))

where graph is the graph I create and vids are only the ID nodes.

I don't understand what the desired graph is supposed to look like. Can you be more specific? It's easier to help you if you include a proper reproducible example with sample input and desired output (avoid ... to be precise -- just a simple example) — MrFlick
On other platforms I have created ID nodes with all the assigned metadata elements to it, and also nodes for each metadata element type. For example, the ID node for 12346 contained its Element1-Element4 as attributes and had a relationship to its assigned Element nodes. Not sure if this is what I should create with igraph, or if I should only have ID nodes that are connected to each other if they share a type of Element. — ladybug_4
Do you need the graph for something else, or do you just need to compute the Jaccard similarity in between your IDs? — Lamia
I don't need to. Is there a way to do this without using igraph's similarity function? — ladybug_4

Lamia Lamia · Accepted Answer · 2018-12-18T21:58:55

If your primary goal is to compute the Jaccard similarity between your IDs, you can do it without the need to create a graph first. The Jaccard similarity is defined as the intersection divided by the union J(A,B) = |A⋂B| / |A⋃B|, which is equivalent to |A⋂B| / ( |A| + |B| - |A⋂B| ). So you can calculate it as follows:

## Example dataset
df = read.table(text = "ID    Element1 Element2 Element3 Element4
  12346  A        12       56       2
  13007  Y        16       66       2
  14008  B        14       56       3
  15078  A        15       56       4
  16000  Y        20       66       3"
,h=T,stringsAsFactors=F)

n = nrow(df)
m = matrix(0,n,n,dimnames=list(df[,1],df[,1]))
for(i in 1:n){
    for(j in i:n){
        m[i,j] = length(intersect(df[i,-1],df[j,-1]))/(2*(n-1)-length(intersect(df[i,-1],df[j,-1])))}}
## Making the matrix symmetrical
m[lower.tri(m)] = t(m)[lower.tri(m)]
> m
          12346     13007     14008     15078     16000
12346 1.0000000 0.1428571 0.1428571 0.3333333 0.0000000
13007 0.1428571 1.0000000 0.0000000 0.0000000 0.3333333
14008 0.1428571 0.0000000 1.0000000 0.1428571 0.1428571
15078 0.3333333 0.0000000 0.1428571 1.0000000 0.0000000
16000 0.0000000 0.3333333 0.1428571 0.0000000 1.0000000

Creating graph with iGraph and computing similarity metric

1 Answers