Finding matches between start nodes for common sources in neo4j

Question

As part of some analysis, I am trying to find targets that have more than 80% common origins for one-hop paths.

The data is of the kind: all nodes are systems, and the only relationship that is relevant is ConnectsTo.

So, I can write queries like

match (n:system)-[r:ConnectsTo]->(m:system) return n,m

to get the sources n for system m.

I am looking to find all systems m that have 80% or more common source systems.

Please advise how this could be done for all systems. I tried with collect but am afraid I couldn't write the proper iteration.

The title of the question says "common targets". Did you mean "common sources"? — Gabor Szarnyas

Gabor Szarnyas Gabor Szarnyas · Accepted Answer · 2016-11-06T23:30:07

Let's start by creating a simple example data set:

CREATE
  (s1:System {name:"s1"}), 
  (s2:System {name:"s2"}), 
  (s3:System {name:"s3"}), 
  (s4:System {name:"s4"}), 
  (s5:System {name:"s5"}), 
  (s1)-[:ConnectsTo]->(s3),
  (s1)-[:ConnectsTo]->(s4),
  (s2)-[:ConnectsTo]->(s3),
  (s2)-[:ConnectsTo]->(s4),
  (s2)-[:ConnectsTo]->(s5)

This result in the following graph.

We start from node pairs (m1 and m2) that have at least a single common source. We calculate:

the number of sources for each node (sources1Count and sources2Count)
the number of common sources (commonSources)

Then we compare the number of common sources to the number of sources for the nodes. This could use a bit of fine-tuning, based on what you consider "80% common". The toFloat function is required to avoid type mismatches.

The query:

MATCH (m1)<-[:ConnectsTo]-()-[:ConnectsTo]->(m2)
MATCH
  (n1)-[:ConnectsTo]->(m1),
  (n2)-[:ConnectsTo]->(m2)
WITH m1, m2, COUNT(DISTINCT n1) AS sources1Count, COUNT(DISTINCT n2) AS sources2Count
MATCH (m1)<-[:ConnectsTo]-(n)-[:ConnectsTo]->(m2)
WITH m1, m2, sources1Count, sources2Count, COUNT(n) AS commonSources
WHERE
  // we only need each m1-m2 pair once
  ID(m1) < ID(m2) AND
  // similarity
  commonSources / 0.8 >= sources1Count AND
  commonSources / 0.8 >= sources2Count
RETURN m1, m2
ORDER BY m1.name, m2.name

This gives the following results.

╒══════════╤══════════╕
│m1        │m2        │
╞══════════╪══════════╡
│{name: s3}│{name: s4}│
└──────────┴──────────┘

PS. for checking the similarity, you could use something like:

sources1Count <= toInt(commonSources / 0.8) >= sources2Count

This avoids the duplication of 0.8 but does not look very nice.

Update: an idea from InverseFalcon in the comments: use SIZE instead of MATCH and COUNT

MATCH (m1)<-[:ConnectsTo]-()-[:ConnectsTo]->(m2)
WITH m1, m2, SIZE(()-[:ConnectsTo]->(m1)) as sources1Count, SIZE(()-[:ConnectsTo]->(m2)) as sources2Count
MATCH (m1)<-[:ConnectsTo]-(n)-[:ConnectsTo]->(m2)
WITH m1, m2, sources1Count, sources2Count, COUNT(n) AS commonSources
WHERE
    // we only need each m1-m2 pair once
    ID(m1) < ID(m2) AND
    // similarity
    commonSources / 0.8 >= sources1Count AND
    commonSources / 0.8 >= sources2Count
RETURN m1, m2
ORDER BY m1.name, m2.name

Finding matches between start nodes for common sources in neo4j

1 Answers