How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

Question

I am currently trying to merge three datasets for analysis purposes. I am using certain common fields to establish the connections between the datasets. In order to create the connections I have tried using the following type of query:

MATCH (p1:Person),(p2:Person)
WHERE p1.email = p2.email AND p1.name = p2.name AND p1 <> p2 
CREATE UNIQUE (p1)-[IS]-(p2);

Which can be similarly written as:

MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2 
CREATE UNIQUE (p1)-[IS]-(p2);

Needless to say, this is a very slow query on a database with about 100,000 Person nodes, specially given that Neo4j does not process single queries in parallel.

Now, my question is whether there is any better way to run such queries in Neo4j. I have at least eight CPU cores to dedicate to Neo4j, as long as separate threads don't tie up by locking each others' required resources.

The issue is that I don't know how Neo4j builds its Cypher execution plans. For instance, let's say I run the following test query:

MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2 
RETURN p1, p2
LIMIT 100;

Despite the LIMIT clause, Neo4j still takes a considerable amount of time to turn in the results, what makes me wonder whether even for such a limited query Neo4j produces the whole cartesian product table before considering the LIMIT statement.

I appreciate any help, whether it addresses this specific issue or just gives me an understanding of how Neo4j generally builds Cypher execution plans (and thus how to optimize queries in general). Can legacy Lucene indexes be of any help here?

maybe the hidden undocumented special word profile will give you more info about the performance. it is suppose to be the equivalent of pSQLs explain analyze. stackoverflow.com/questions/17760627/… — ulkas

Michael Hunger Michael Hunger · Accepted Answer · 2014-06-26T19:19:42

You can do a combination of a label scan for p1 and then index lookup + comparison for p2:

see here:

cypher 2.1 
foreach (i in range(1,100000) | 
  create (:Person {name:"John Doe"+str(i % 10000),
                   email:"john"+str(i % 10000)+"@doe.com"}));
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 100000
Properties set: 200000
Labels added: 100000
6543 ms
neo4j-sh (?)$ CREATE INDEX ON :Person(name);
+-------------------+
| No data returned. |
+-------------------+
Indexes added: 1
28 ms

neo4j-sh (?)$ schema
Indexes
  ON :Person(name)  ONLINE

neo4j-sh (?)$ 
match (p1:Person) with p1 
match (p2:Person {name:p1.name}) using index p2:Person(name) 
where p1<>p2 AND p2.email = p1.email 
return count(*);
+----------+
| count(*) |
+----------+
| 900000   |
+----------+
1 row
8206 ms

neo4j-sh (?)$ 
match (p1:Person) with p1 
match (p2:Person {name:p1.name}) using index p2:Person(name) 
where p1<>p2 AND p2.email = p1.email
merge (p1)-[:IS]-(p2) 
return count(*);

+----------+
| count(*) |
+----------+
| 900000   |
+----------+
1 row
Relationships created: 450000
40256 ms

How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

1 Answers