I am pretty new to neo4j and the cypher query language.
My node / relationship dataset basically looks like the following:
- I have about 27000 User nodes in the database
- I have about 8000 Question nodes in the database
- Question nodes can be answered by User nodes so there are basically relationships like (user)-[:ANSWERED]->(Question)
- Some of the Question nodes trigger a property for the user, so there are relationships like (user)-[:HAS_PROPERTY]->(Property)
- Furthermore some of the Question nodes require some Properties to be able to get answered. So there are relationships like (Question)-[:REQUIRES]->(Property)
Now my query is all about finding the Questions not already answered by a particular user, taking question property requirements into account with a limit of 50 questions.
After hassling around for a while I came up with the following query:
MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question)
OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q)
WITH q, user
WHERE a IS NULL
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)
WITH q, user, count(r) as rCount
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount
WHERE rCount = 0 or rCount = hCount
RETURN q ORDER BY q.priority DESC LIMIT 50
The above query gives me the rows as expected and is pretty fast (about 150 milliseconds), which is awesome.
What I do not understand is the following:
When I replace the second line in the query with the user variable instead of doing a label lookup the query gets very slow. Especially for users who have answered a lot or even all the questions.
So the following query is a lot slower:
MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question)
OPTIONAL MATCH (user)-[a:ANSWERED]->(q)
WITH q, user
WHERE a IS NULL
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)
WITH q, user, count(r) as rCount
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount
WHERE rCount = 0 or rCount = hCount
RETURN q ORDER BY q.priority DESC LIMIT 50
Why is this the case, cause I really don't understand it? Actually I thought the query would be cheaper with reusing the already matched user as a basis for the second optional match.
While doing my research on cypher performance I came along a lot of articles telling me that you should try to avoid optional matches if possible. So my first query looked like the following:
MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question)
MATCH (q) WHERE NOT (q)<-[:ANSWERED]->(user)
WITH q, user
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)
WITH q, user, count(r) as rCount
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount
WHERE rCount = 0 or rCount = hCount
RETURN q ORDER BY q.priority DESC LIMIT 50
Same problem here. The above query is a lot slower than the first one. About 20-30 times slower.
Finally I would like to ask if I am missing something and if there is even a better way to achieve my goal.
Any help would be appreciated.
Regards,
Alex
EDIT
Below are some profiling details:
Using the following query:
MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question)
OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q)
WITH q, user
WHERE a IS NULL
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)
WITH q, user, count(r) as rCount
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount
WHERE rCount = 0 or rCount = hCount
RETURN q ORDER BY q.priority DESC LIMIT 50
Cypher version: CYPHER 2.2, planner: COST. 26979 total db hits in 169 ms.
Using the suggested query from Michael Hunger:
MATCH (user:User {code: 'abc'})
MATCH (:ActiveQuestions)-[]->(q:Question)
WHERE NOT (user)-[:ANSWERED]->(q)
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)
WITH q, user, count(r) as rCount
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount
WHERE rCount = 0 or rCount = hCount
RETURN q ORDER BY q.priority DESC LIMIT 50
Cypher version: CYPHER 2.2, planner: COST. 2337573 total db hits in 2622 ms.
So my current query is much faster and more efficient.
What I really don't understand and why I titled the post "Strange neo4j cypher behaviour" is the fact, that when I modify the second row of my kinda fast query from:
OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q)
to:
OPTIONAL MATCH (user)-[a:ANSWERED]->(q)
which would be kinda simpler and logic for me I get the following:
MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question)
WHERE NOT (user)-[:ANSWERED]->(q)
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)
WITH q, user, count(r) as rCount
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount
WHERE rCount = 0 or rCount = hCount
RETURN q ORDER BY q.priority DESC LIMIT 50
Cypher version: CYPHER 2.2, planner: COST. 2337573 total db hits in 2391 ms.
So I get totally the same amount of DB hits as with the priorly mentioned slow query.
Does anybody has an explanation for this?
Furthermore it doesn't make any difference when I modify the first row
from:
MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question)
to:
MATCH (user:User {code: 'xyz'})
MATCH (:ActiveQuestions)-[]->(q:Question)
So I have basically two questions:
Why is the query much slower when reusing the already defined User node variable (user) compared to using
(user:User {code: 'xyz'})With my second row I use the quasi equivalent of an outer join. Against all recommendations I have come over this is much faster than using
MATCH (q) WHERE NOT (q)<-[:ANSWERED]->(user)I thought the latter is doing an outer join too which seems not to be the case.EDIT
After some further profiling I came up with a bit cheaper query. See the profiling details below:
Using the following cypher query:
MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q)
OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q)
WITH q, user
WHERE a IS NULL
OPTIONAL MATCH (q)-[r:REQUIRES]->(p)
WITH q, user, count(r) as rCount
OPTIONAL MATCH (q)-[r:REQUIRES]->(p)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount
WHERE rCount = hCount
RETURN q ORDER BY q.priority DESC LIMIT 50
Cypher version: CYPHER 2.2, planner: COST. 21669 total db hits in 120 ms.
So I basically got rid of the explicit node labels (:Question) and (:Property) in example, which sounds logic for me because no explicit label scanning is required any more. This saves me about 5300 DB hits.
Anything else that could be tuned on this query?