3
votes

I have

  • 50K Post nodes
  • 40K Tag nodes
  • 125K TAGGED relationships (meaning average 2,5 tags per post)

in my graph and below query causes a "Java heap space" error.

match (p1:Post)-[r1:TAGGED]->(t:Tag)<-[r2:TAGGED]-(p2:Post)
return p1.Title, count(r1), p2.Title, count(r2)
limit 10

What I expected was some repeated rows depending on number of shared tags. I was not sure how limit would work (stop after first 10 posts or tags). But, since I have limit 10 I did not expect this query to traverse all the graph. It seems like it does.

UPDATE 1

With a few changes, Christophe Willemsen's query returns 10 rows in 15 sec.

// I need label for the otherPost because Users are also TAGGED
MATCH (post:Post)-[:TAGGED]->(t)<-[:TAGGED]-(otherPost:Post) 
RETURN post.Title, count(t) as cnt, otherPost.Title
// ORDER BY cnt DESC // for now I do not need this
LIMIT 10;

I thought "ORDER BY" clause may cause traversal of all possible paths so I removed the clause but it is still 15 sec. It is also 15 sec. when I make the limit value 1 or 1000 without sorting.

What I expect from Neo4j was: "Start from any Post node, then jump to its Tags and find otherPosts that are tagged with the same tag. When there are 10 found stop traversing and return the results." I am pretty sure it is not doing this.

To make my expectation clear, assume that graph is this small and we use Limit 3 in the cypher query.

p1 - [t1, t2, t3] // Post1 is tagged with t1, t2 and t3
p2 - [t2, t3, t4]
p3 - [t3, t4, t5]

What I expect is:

  • Start form p1 (or any Post node)
  • Jump to t1
  • No other posts are tagged with t1
  • Jump to t2
  • p2 is tagged with t2 (1 of 3)
  • No other posts are tagged with t2
  • Jump to t3
  • p2 is tagged with t3 (2 of 3)
  • p3 is tagged with t3 (3 of 3)
  • we reached the limit, break

But, it seems like Limit is applied after traversing all data.

So, my question is now: Did Neo4j found all the matches and returned 10 of them or did it stop searching after first 10 matches? And of course, Why?

UPDATE 2

After helpful answers I managed to decrease the scope of my question so I tried below queries.

// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1; 

// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1000; 

// 100 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1; 

// 150 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1000;

So, I still do not know why but, using aggregation methods (I tried collect(t.Name) instead of count) breaks the expected (at least my expectations :) behaviour of limit functionality.

2
could you please describe the query in natural language?Stefan Armbruster
I did not write the query for getting some meaningful data. I just tried to fetch first ten posts those are sharing at least one tag with another post. After seeing that result I was planning to sort the data according to number of shared tags.Mehmet Ataş
have you tried this on 2.2.0-RC1 ? 2.2 has a much improved query optimizer that might help here.Stefan Armbruster

2 Answers

6
votes

This query will result in a global graph lookup, at least for neo4j 2.1.7 and below.

I would first matching the nodes and then expanding the path

MATCH (post:Post)
MATCH (post)-[:TAGS]->(t)<-[:TAGS]-(otherPost)
RETURN post, count(t) as cnt, otherPost
ORDER BY cnt DESC
LIMIT 10;

And this is the execution plan, as you can see by matching first the post nodes only (so labels index) it costs you only retrieving those and following relationships

ColumnFilter
  |
  +Top
    |
    +EagerAggregation
      |
      +Filter
        |
        +SimplePatternMatcher
          |
          +NodeByLabel

+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
|             Operator |   Rows | DbHits |                                  Identifiers |                                                                                          Other |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
|         ColumnFilter |     10 |      0 |                                              |                                                              keep columns post, cnt, otherPost |
|                  Top |     10 |      0 |                                              | {  AUTOINT0}; Cached(  INTERNAL_AGGREGATEc24f01bf-69cc-4bd9-9aed-be257028194b of type Integer) |
|     EagerAggregation |   9900 |      0 |                                              |                                                                                post, otherPost |
|               Filter | 134234 |      0 |                                              |                                                                NOT(  UNNAMED30 ==   UNNAMED43) |
| SimplePatternMatcher | 134234 |      0 | t,   UNNAMED43,   UNNAMED30, post, otherPost |                                                                                                |
|          NodeByLabel |    100 |    101 |                                   post, post |                                                                                          :Post |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+

Total database accesses: 101

And here a blog post explaining why I removed labels except for the first part of the query : http://graphaware.com/neo4j/2015/01/16/neo4j-graph-model-design-labels-versus-indexed-properties.html

2
votes

What Christophe said and Try to reduce the cardinality in between:

match (p1:Post)-[r1:TAGGED]->(t:Tag)
WITH tag, count(*) as freq, collect(distinct p1.Title) as posts
MATCH (tag)<-[r2:TAGGED]-(p2:Post)
return posts, freq, p2.Title, count(r2)
limit 10