1
votes

I am pretty new to neo4j and the cypher query language.

My node / relationship dataset basically looks like the following:

  1. I have about 27000 User nodes in the database
  2. I have about 8000 Question nodes in the database
  3. Question nodes can be answered by User nodes so there are basically relationships like (user)-[:ANSWERED]->(Question)
  4. Some of the Question nodes trigger a property for the user, so there are relationships like (user)-[:HAS_PROPERTY]->(Property)
  5. Furthermore some of the Question nodes require some Properties to be able to get answered. So there are relationships like (Question)-[:REQUIRES]->(Property)

Now my query is all about finding the Questions not already answered by a particular user, taking question property requirements into account with a limit of 50 questions.

After hassling around for a while I came up with the following query:

MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 
WITH q, user 
WHERE a IS NULL 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50

The above query gives me the rows as expected and is pretty fast (about 150 milliseconds), which is awesome.

What I do not understand is the following:

When I replace the second line in the query with the user variable instead of doing a label lookup the query gets very slow. Especially for users who have answered a lot or even all the questions.

So the following query is a lot slower:

MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
OPTIONAL MATCH (user)-[a:ANSWERED]->(q) 
WITH q, user 
WHERE a IS NULL 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50

Why is this the case, cause I really don't understand it? Actually I thought the query would be cheaper with reusing the already matched user as a basis for the second optional match.

While doing my research on cypher performance I came along a lot of articles telling me that you should try to avoid optional matches if possible. So my first query looked like the following:

MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
MATCH (q) WHERE NOT (q)<-[:ANSWERED]->(user) 
WITH q, user 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50

Same problem here. The above query is a lot slower than the first one. About 20-30 times slower.

Finally I would like to ask if I am missing something and if there is even a better way to achieve my goal.

Any help would be appreciated.

Regards,

Alex

EDIT

Below are some profiling details:

Using the following query:

MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 
WITH q, user 
WHERE a IS NULL 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50

Cypher version: CYPHER 2.2, planner: COST. 26979 total db hits in 169 ms.

Using the suggested query from Michael Hunger:

MATCH (user:User {code: 'abc'})
MATCH (:ActiveQuestions)-[]->(q:Question) 
WHERE NOT (user)-[:ANSWERED]->(q) 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50

Cypher version: CYPHER 2.2, planner: COST. 2337573 total db hits in 2622 ms.

So my current query is much faster and more efficient.

What I really don't understand and why I titled the post "Strange neo4j cypher behaviour" is the fact, that when I modify the second row of my kinda fast query from:

OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 

to:

OPTIONAL MATCH (user)-[a:ANSWERED]->(q) 

which would be kinda simpler and logic for me I get the following:

MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 
WHERE NOT (user)-[:ANSWERED]->(q) 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50

Cypher version: CYPHER 2.2, planner: COST. 2337573 total db hits in 2391 ms.

So I get totally the same amount of DB hits as with the priorly mentioned slow query.

Does anybody has an explanation for this?

Furthermore it doesn't make any difference when I modify the first row

from:

MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q:Question) 

to:

MATCH (user:User {code: 'xyz'})
MATCH (:ActiveQuestions)-[]->(q:Question)

So I have basically two questions:

  1. Why is the query much slower when reusing the already defined User node variable (user) compared to using (user:User {code: 'xyz'})

  2. With my second row I use the quasi equivalent of an outer join. Against all recommendations I have come over this is much faster than using MATCH (q) WHERE NOT (q)<-[:ANSWERED]->(user) I thought the latter is doing an outer join too which seems not to be the case.

    EDIT

After some further profiling I came up with a bit cheaper query. See the profiling details below:

Using the following cypher query:

MATCH (user:User {code: 'xyz'}), (:ActiveQuestions)-[]->(q) 
OPTIONAL MATCH (:User {code: 'xyz'})-[a:ANSWERED]->(q) 
WITH q, user 
WHERE a IS NULL 
OPTIONAL MATCH (q)-[r:REQUIRES]->(p) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(p)<-[h:HAS_PROPERTY]-(user)
WITH q, rCount, count(h) as hCount 
WHERE rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50

Cypher version: CYPHER 2.2, planner: COST. 21669 total db hits in 120 ms.

So I basically got rid of the explicit node labels (:Question) and (:Property) in example, which sounds logic for me because no explicit label scanning is required any more. This saves me about 5300 DB hits.

Anything else that could be tuned on this query?

1

1 Answers

1
votes

You span up a lot of rows with that second match, which you have to collapse again, so if you change your first WITH to either with distinct q, user or an aggregation with q,user, count(*) as answers. Then you reduce your cardinality again.

Also this already spans up a lot of rows I think (:ActiveQuestions)-[]->(q:Question)

If you run your query with PROFILE you should see how much data is accessed.

In general I'd try to change your OPTIONAL MATCH into a WHERE condition and see how it goes.

Btw. you can just label the active-questions as :ActiveQuestion no need for the additional relationship. I also added a rel-type.

MATCH (user:User {code: 'xyz'})
MATCH (:ActiveQuestions)-[:IS_ACTIVE]->(q:Question) 
WHERE NOT (user)-[:ANSWERED]->(q) 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property) 
WITH q, user, count(r) as rCount 
OPTIONAL MATCH (q)-[r:REQUIRES]->(:Property)<-[h:HAS_PROPERTY]-(user) 
WITH q, rCount, count(h) as hCount 
WHERE rCount = 0 or rCount = hCount 
RETURN q ORDER BY q.priority DESC LIMIT 50