1
votes

I have the following Cypher query in Neo4J, which gets all the nodes in the graph and their connections for a JSON file, which is then used to display a graph using Sigma.Js library.

MATCH (c1:Concept), (c2:Concept), (ctx:Context), c1-[rel:TO]->c2 
WHERE (rel.user='9d6e7140-f3c3-11e3-927f-1f5ca4210ac7' 
AND ctx.uid = rel.context) 
WITH DISTINCT c1, c2 
MATCH (ctxname:Context), c1-[relall:TO]->c2 
WHERE (relall.user='9d6e7140-f3c3-11e3-927f-1f5ca4210ac7' 
AND ctxname.uid = relall.context) 
RETURN DISTINCT 
c1.uid AS source_id, 
c1.name AS source_name, 
c2.uid AS target_id, 
c2.name AS target_name, 
relall.uid AS edge_id, 
ctxname.name AS context_name, 
relall.statement AS statement_id, 
relall.weight AS weight;

This particular query returns 89 rows of data.

The strange thing is that it works relatively fast when the number of c1 and c2 nodes and rel relationships is small. However, as the number of those nodes and the relations between them increase the query gets super slow, probably because Neo4J has to reiterate through a lot of relationships.

Do you have any idea how I could make this query faster provided that I need it to return data in the same format and that it should be all made in one query?

Here's the profile info:

Distinct(_rows=89, _db_hits=0)
Extract(symKeys=["c1", "c2", "ctxname", "relall"], exprKeys=["source_name", 
"statement_id", "edge_id", "target_id", "source_id", "target_name", "context_name", 
"weight"], _rows=89, _db_hits=712)

Filter(pred="(Property(relall,user(8)) == Literal(9d6e7140-f3c3-11e3-927f-1f5ca4210ac7) 
AND Property(ctxname,uid(1)) == Property(relall,context(7)))", _rows=89, _db_hits=267)
SimplePatternMatcher(g="(c1)-['relall']-(c2)", _rows=89, _db_hits=2166150)
NodeByLabel(identifier="ctxname", _db_hits=0, _rows=44100, label="Context", 
identifiers=["ctxname"], producer="NodeByLabel")

Distinct(_rows=84, _db_hits=0)
Filter(pred="Property(ctx,uid(1)) == Property(rel,context(7))", _rows=89, _db_hits=93450)
        NodeByLabel(identifier="ctx", _db_hits=0, _rows=46725, label="Context",
 identifiers=["ctx"], producer="NodeByLabel")
          Filter(pred="hasLabel(c2:Concept(1))", _rows=89, _db_hits=0)
            TraversalMatcher(start={"label": "Concept", "producer": "NodeByLabel",      
"identifiers": ["c1"]}, trail="(c1)-[rel:TO WHERE hasLabel(NodeIdentifier():Concept(1)) 
AND Property(RelationshipIdentifier(),user(8)) == Literal(9d6e7140-f3c3-11e3-927f-
1f5ca4210ac7)]->(c2)", _rows=89, _db_hits=127572)

Thank you for any help you can provide or at least if you can tell me where the weak spot of this query is judging from the profile info above...

2

2 Answers

1
votes

Your relatioship is a "hyperedge" and should be a node, and you know this from past discussion :)

As you don't have an index lookup for the starting point this query has to scan the full graph.

Enable the relationship-auto-index for the field user and start this query with a relationship-lookup.

Also your Context is matched for every relationship it finds, not sure if you expect more than one context to match ??

Also make sure to have an index on :Context(uid)

START rel = relationship:relationship_auto_index(user='9d6e7140-f3c3-11e3-927f-1f5ca4210ac7')
WHERE type(rel) = "TO"
WITH rel, startNode(rel) as c1, endNode(rel) as c2
WHERE (c1:Concept) AND (c2:Concept)
MATCH (ctx:Context)
WHERE ctx.uid = rel.context
WITH DISTINCT c1, c2 
MATCH c1-[relall:TO]->c2 
WHERE (relall.user='9d6e7140-f3c3-11e3-927f-1f5ca4210ac7') 
MATCH (ctxname:Context)
WHERE ctxname.uid = relall.context
RETURN DISTINCT 
c1.uid AS source_id, 
c1.name AS source_name, 
c2.uid AS target_id, 
c2.name AS target_name, 
relall.uid AS edge_id, 
ctxname.name AS context_name, 
relall.statement AS statement_id, 
relall.weight AS weight;
1
votes

First, I would recommend in-lining as much info as you can. The Cypher planner works more efficiently with inline matches. (Namely, it gives it more versatility for how to find the items, because the relationships are more explicit)

Second, less matches are better, because the planner can better plan to not touch nodes.

After that, only indexes will help performance. Namely, on TO.user and Context.uid (with those indexes, this should just become just a couple back-end quick db fetches)

Here is your same query, but with the where ... and ... statements converted to inline matches. Comments added, but you should also remove everything above my last comment as that is wasted computation effort that will just confuse the Cypher Planner (as far as this example query is concerned)

MATCH (c1:Concept)-[rel:TO{user:'9d6e7140-f3c3-11e3-927f-1f5ca4210ac7'}]->(c2:Concept), (ctx:Context{uid:rel.context})
// Wait, why did we match ctx then?
WITH DISTINCT c1, c2
// We just did this... This match makes everything above it redundant...
MATCH (c1:Concept)-[relall:TO{user:'9d6e7140-f3c3-11e3-927f-1f5ca4210ac7'}]->(c2:Concept), (ctxname:Context{uid:relall.context})
RETURN DISTINCT 
c1.uid AS source_id, 
c1.name AS source_name, 
c2.uid AS target_id, 
c2.name AS target_name, 
relall.uid AS edge_id, 
ctxname.name AS context_name, 
relall.statement AS statement_id, 
relall.weight AS weight;