0
votes

Setup:

  • Neo4j - 1.9.3
  • ~7,000 nodes
  • ~1.8 million relationships

I have the following cypher query that I would like to improve the performance on:

START a=node(2) MATCH (a)-[:knowledge]-(x)-[:depends]-(y)-[:knowledge]-(end) RETURN COUNT(DISTINCT end);

This returns 471 (188171 ms).

Right now I'm only getting a count but later I may want to get the values (471 in this example). The problem is it takes about 3-4 minutes to run.

The graph is highly connected with many relationships. Running the following shows how many edges of type "knowledge" exist for node a(2).

START a=node(2) MATCH (a)-[:knowledge]-(x) RETURN COUNT(a);

This returns 4350 (103 ms).

To me, this doesn't seem like many edges to check. Can I split this up somehow to improve performance?

edit: As per the comments, here are the results from running the query with profile:

profile START a=node(2) MATCH (a)-[:knowledge]-(x)-[:depends]-(y)-[:knowledge]-(end) RETURN COUNT(DISTINCT end);
==> +---------------------+
==> | COUNT(DISTINCT end) |
==> +---------------------+
==> | 471                 |
==> +---------------------+
==> 1 row
==> 
==> ColumnFilter(symKeys=["  INTERNAL_AGGREGATEcd2aff18-1c9d-47a8-9217-588cb86bbc1a"], returnItemNames=["COUNT(DISTINCT end)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["(  INTERNAL_AGGREGATEcd2aff18-1c9d-47a8-9217-588cb86bbc1a,Distinct)"], _rows=1, _db_hits=0)
==>   TraversalMatcher(trail="(a)-[  UNNAMED7:knowledge WHERE true AND true]-(x)-[  UNNAMED8:depends WHERE true AND true]-(y)-[  UNNAMED9:knowledge WHERE true AND true]-(end)", _rows=25638262, _db_hits=25679365)
==>     ParameterPipe(_rows=1, _db_hits=0)
1
Can you add directions, or is this not supposed to be directed?Eve Freeman
4350 is not a lot, but if (x)-[:depends]-(y) and (y)-[:knowledge]-(end) match just as many you have 82312875000 paths. In a graph with many relationships to nodes, what Wes says about specifying direction could be particularly helpful in ruling out irrelevant paths. You may want to run the query in webadmin with PROFILE prepended and look at (post here) the execution plan, which would show how many matches (rows) are maintained at each step of the query. Also, try returning without DISTINCT to see how many end were actually matched.jjaderberg
I'm not really using direction in this case so I don't believe I can optimize on that. I've updated my post with the profile results. As you can see, with almost 26 million hits, that's a lot to sort through. The other alternative I have is to add an additional edge type so then there will be only one individual edge for the :knowledge portions. It would mean adding unnecessary data but if I need it to increase performance, it might be worth it.firefly2442

1 Answers

2
votes

I ended up doing the following to improve performance:

profile START a=node(2) MATCH (a)-[:knowledge]-(x) WITH DISTINCT x MATCH (x)-[:depends]-(y) WITH DISTINCT y MATCH (y)-[:knowledge]-(end) WITH DISTINCT end RETURN COUNT(end);
==> +------------+
==> | COUNT(end) |
==> +------------+
==> | 471        |
==> +------------+
==> 1 row
==> 
==> ColumnFilter(symKeys=["  INTERNAL_AGGREGATE1967576a-d357-457a-b799-adbb16b93048"], returnItemNames=["COUNT(end)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["(  INTERNAL_AGGREGATE1967576a-d357-457a-b799-adbb16b93048,Count)"], _rows=1, _db_hits=0)
==>   Distinct(_rows=471, _db_hits=0)
==>     PatternMatch(g="(end)-['  UNNAMED3']-(y)", _rows=403437, _db_hits=0)
==>       Distinct(_rows=735, _db_hits=0)
==>         PatternMatch(g="(x)-['  UNNAMED2']-(y)", _rows=1653, _db_hits=0)
==>           Distinct(_rows=177, _db_hits=0)
==>             TraversalMatcher(trail="(a)-[  UNNAMED1:knowledge WHERE true AND true]-(x)", _rows=4350, _db_hits=4351)
==>               ParameterPipe(_rows=1, _db_hits=0)

By making each step a small part in the overall, it reduces the overall complexity and only follows edges that will match.