I am currently trying to merge three datasets for analysis purposes. I am using certain common fields to establish the connections between the datasets. In order to create the connections I have tried using the following type of query:
MATCH (p1:Person),(p2:Person)
WHERE p1.email = p2.email AND p1.name = p2.name AND p1 <> p2
CREATE UNIQUE (p1)-[IS]-(p2);
Which can be similarly written as:
MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2
CREATE UNIQUE (p1)-[IS]-(p2);
Needless to say, this is a very slow query on a database with about 100,000 Person nodes, specially given that Neo4j does not process single queries in parallel.
Now, my question is whether there is any better way to run such queries in Neo4j. I have at least eight CPU cores to dedicate to Neo4j, as long as separate threads don't tie up by locking each others' required resources.
The issue is that I don't know how Neo4j builds its Cypher execution plans. For instance, let's say I run the following test query:
MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2
RETURN p1, p2
LIMIT 100;
Despite the LIMIT clause, Neo4j still takes a considerable amount of time to turn in the results, what makes me wonder whether even for such a limited query Neo4j produces the whole cartesian product table before considering the LIMIT statement.
I appreciate any help, whether it addresses this specific issue or just gives me an understanding of how Neo4j generally builds Cypher execution plans (and thus how to optimize queries in general). Can legacy Lucene indexes be of any help here?
profile
will give you more info about the performance. it is suppose to be the equivalent of pSQLsexplain analyze
. stackoverflow.com/questions/17760627/… – ulkas