Efficiently Find Unique Intermediate Nodes Connecting Two Sets of Nodes in Neo4j/Cypher

Question

I have data on movies that can be either comedies or dramas. I have actors in those movies, who can have multiple roles per movie. I want to find all distinct sets of movies and actors where:

(drama1:Movie {Genre:'Drama'})-[role1]-(actor1:Actor)-[role2]-(comedy:Movie {Genre:'Comedy'})-[role3]-(actor2:Actor)-[role4]-(drama2:Movie {Genre:'Drama'})

That is, I want to find where two (different) dramas are connected by a comedy with which both dramas share at least one actor. I'm struggling to do this efficiently and to get neo4j to give me distinct groups of drama1,drama2,actor1,actor2,comedy. My data is on the order of a few million nodes and tens of millions of relationships, so efficiency is important. A toy setup, which can be plugged into the neo4j online console is:

create (a:Movie {Genre:'Comedy'}), (b:Movie {Genre:'Comedy'}), (c:Movie {Genre:'Comedy'}), (d:Movie {Genre:'Comedy'}), (f:Movie {Genre:'Drama'}), (h:Movie {Genre:'Drama'}),(i:Actor {Name:'Sarah'}),(j:Actor {Name: 'Maria'}),(k:Actor {Name:'Mike'}),(l:Actor {Name:'Jane'}),(m:Actor {Name:'Sam'}),(q:Actor {Name:'Matt'}),(r:Actor {Name:'Tom'}), (i)-[:ActedIn]->(a), (i)-[:ActedIn]->(a) , (i)-[:ActedIn]->(a), (i)-[:ActedIn]->(a) , (i)-[:ActedIn]->(f) , (j)-[:ActedIn]->(b) , (j)-[:ActedIn]->(h) , (j)-[:ActedIn]->(h) , (q)-[:ActedIn]->(c) , (q)-[:ActedIn]->(b) , (q)-[:ActedIn]->(a) , (r)-[:ActedIn]->(f) , (r)-[:ActedIn]->(f) , (r)-[:ActedIn]->(a) , (j)-[:ActedIn]->(b) , (j)-[:ActedIn]->(c) , (k)-[:ActedIn]->(d), (l)-[:ActedIn]->(c) , (i)-[:ActedIn]->(a) , (i)-[:ActedIn]->(h) , (m)-[:ActedIn]->(h)

I've mostly tried variations of

match (drama1:Movie {Genre:'Drama'})-[role1]-(actor1:Actor)-[role2]-(comedy:Movie {Genre:'Comedy'})-[role3]-(actor2:Actor)-[role4]-(drama2:Movie {Genre:'Drama'}) return drama1,actor1,comedy,actor2,drama2

dgrana dgrana · Accepted Answer · 2021-05-06T16:27:34

You can use a multi match query like this:

MATCH p = (drama1:Movie {Genre:'Drama'})-[r1]-(a1)-[r2]-(comedy:Movie {Genre:'Comedy'})-[r3]-(a2)-[r4]-(drama2:Movie {Genre:'Drama'})
MATCH (drama1)-[]-(a:Actor)-[]-(drama2) 
WHERE drama1<>drama2 
RETURN nodes(p)

Query explanation:

The first MATCH find 2 dramas connected by a comedy
The second MATCH tells that the 2 dramas have to be connected by the same actor
The WHERE clause indicate that the 2 dramas have to be different
RETURN clause returns the pattern

There is no much to do to optimize the query because you are not filtering by any property, so you are going through the entire graph. You can try creating an index on 'Genre' property but it's not recomended if you only have two different values.

An other option is using SKIP/LIMIT to retrive patterns in smaller bulks:

MATCH p = (drama1:Movie {Genre:'Drama'})-[r1]-(a1)-[r2]-(comedy:Movie {Genre:'Comedy'})-[r3]-(a2)-[r4]-(drama2:Movie {Genre:'Drama'})
MATCH (drama1)-[]-(a:Actor)-[]-(drama2) 
WHERE drama1<>drama2 
RETURN nodes(p) SKIP 0 LIMIT 100000

This query gets the first 100k patterns and then you have to increment the skip to get the next 100k (SKIP 100000 LIMIT 100000). By this way you have to run 10 queries to get all patterns.

Efficiently Find Unique Intermediate Nodes Connecting Two Sets of Nodes in Neo4j/Cypher

1 Answers