why is neo4j so slow on this cypher query?

Question

I have a fairly deep tree that consists of an initial "transaction" node (call that the 0th layer of the tree), from which there are 50 edges to the next nodes (call it the 1st later of the tree), and then from each of those around 35 on average to the second layer, and so on...

The initial node is a :txnEvent and all the rest are :mEvent

mEvent nodes have 4 properties, one of them called channel_name

Now, I would like to retrieve all paths that go down to the 4th layer such that those paths contain a node with channel_name==A and also channel_name==B

This query:

match (n: txnEvent)-[r:TO*1..4]->(m:mEvent) return COUNT(*);

Is telling me there are only 1,667,444 paths to consider.

However, the following query:

MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
    WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
    AND   ANY(k IN nodes(p) WHERE k.channel_name='B')       
    RETURN 
    EXTRACT (n in nodes(p) | n.channel_name),
    EXTRACT (n in nodes(p) | n.step),
    EXTRACT (n in nodes(p) | n.event_type),
    EXTRACT (n in nodes(p) | n.event_device),
    EXTRACT (r in relationships(p) | r.weight )

Takes almost 1 minute to execute (neo4j's UI on port 7474)

For completness, neo4j is telling me:

"Started streaming 125517 records after 2 ms and completed after 50789 ms, displaying first 1000 rows."

So I'm wondering whether there's something obvious I'm missing. All of the properties that nodes have are indexed by the way. Is the query slow, or is it fast and the streaming is slow?

UDATE: This query, that doesn't stream data back:

MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
        WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
        AND   ANY(k IN nodes(p) WHERE k.channel_name='B')       
        RETURN 
        COUNT(*)

Takes 35s, so even though it's faster, presumably because no data is returned, I feel it's still quite slow.

UPDATE 2:

Ideally this data should go into a jupyter notebook with a python kernel.

Can you take a screenshot of the PROFILE plan of the query, after expanding all elements of the plan? — InverseFalcon

InverseFalcon InverseFalcon · Accepted Answer · 2018-01-20T20:27:38

Thanks for the PROFILE plan.

Keep in mind that the query you're asking for is a difficult one to process. Since you want paths where at least one node in the path has one property and at least one other node in the path has another property, there is no way to prune paths during expansion. Instead, every possible path has to be determined, and then every node in each of those 1.6 million paths has to be accessed to check for the property (and that has to be done twice for each path, for both properties). Thus the ~10 million db hits for the filter operation.

You could try expanding your heap and pagecache sizes (if you have the RAM to spare), but I don't see any easy ways to tune this query.

As for your question about the query time vs streaming, the problem is the query itself. The message you saw means that the first result was found extremely quickly so the first result was ready in the stream almost immediately. Results are added to the stream as they're found, but the volume of paths needing to be matched and filtered with no ability to prune paths during expansion means it took a very long time for the query to complete.

why is neo4j so slow on this cypher query?

1 Answers