I have
- 50K Post nodes
- 40K Tag nodes
- 125K TAGGED relationships (meaning average 2,5 tags per post)
in my graph and below query causes a "Java heap space" error.
match (p1:Post)-[r1:TAGGED]->(t:Tag)<-[r2:TAGGED]-(p2:Post)
return p1.Title, count(r1), p2.Title, count(r2)
limit 10
What I expected was some repeated rows depending on number of shared tags. I was not sure how limit
would work (stop after first 10 posts or tags). But, since I have limit 10
I did not expect this query to traverse all the graph. It seems like it does.
UPDATE 1
With a few changes, Christophe Willemsen's query returns 10 rows in 15 sec.
// I need label for the otherPost because Users are also TAGGED
MATCH (post:Post)-[:TAGGED]->(t)<-[:TAGGED]-(otherPost:Post)
RETURN post.Title, count(t) as cnt, otherPost.Title
// ORDER BY cnt DESC // for now I do not need this
LIMIT 10;
I thought "ORDER BY" clause may cause traversal of all possible paths so I removed the clause but it is still 15 sec. It is also 15 sec. when I make the limit value 1 or 1000 without sorting.
What I expect from Neo4j was: "Start from any Post
node, then jump to its Tags and find otherPosts that are tagged with the same tag. When there are 10 found stop traversing and return the results." I am pretty sure it is not doing this.
To make my expectation clear, assume that graph is this small and we use Limit 3
in the cypher query.
p1 - [t1, t2, t3] // Post1 is tagged with t1, t2 and t3
p2 - [t2, t3, t4]
p3 - [t3, t4, t5]
What I expect is:
- Start form p1 (or any Post node)
- Jump to t1
- No other posts are tagged with t1
- Jump to t2
- p2 is tagged with t2 (1 of 3)
- No other posts are tagged with t2
- Jump to t3
- p2 is tagged with t3 (2 of 3)
- p3 is tagged with t3 (3 of 3)
- we reached the limit, break
But, it seems like Limit is applied after traversing all data.
So, my question is now: Did Neo4j found all the matches and returned 10 of them or did it stop searching after first 10 matches? And of course, Why?
UPDATE 2
After helpful answers I managed to decrease the scope of my question so I tried below queries.
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1;
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1000;
// 100 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1;
// 150 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1000;
So, I still do not know why but, using aggregation methods (I tried collect(t.Name)
instead of count
) breaks the expected (at least my expectations :) behaviour of limit
functionality.