This question is similar to these two: 16283441, 15456345.
UPDATE: here's a database dump.
In a db of 190K nodes and 727K relationships (and 128MB of database disk usage), I'd like to run the following query:
START start_node=node(<id>)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) as num_partner_partner_links;
In this db 90% of the nodes have 0 relationships, and the remaining 10% have from 1 up to 670, so the biggest network this query can return cannot possibly have more than 220K links (670*670)/2).
On nodes with less than 10K partner_partner_links the query takes 2-4 seconds, when wormed up. For more connected nodes (20-45K links) it takes about 40-50sec (don't know how much it'd take for the most connected ones).
Specifying relationship direction helps a bit but not much (but then the query doesn't return what I need it to return).
Profiling the query on one of the biggest nodes says:
==> ColumnFilter(symKeys=[" INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6"], returnItemNames=["num_partner_links"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["( INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6,Distinct)"], _rows=1, _db_hits=0)
==> PatternMatch(g="(partner)-['r']-(start_node)", _rows=97746, _db_hits=34370048)
==> TraversalMatcher(trail="(start_node)-[ UNNAMED3:COOCCURS_WITH WHERE true AND true]-(another_partner)-[s:COOCCURS_WITH WHERE true AND true]-(partner)", _rows=116341, _db_hits=117176)
==> ParameterPipe(_rows=1, _db_hits=0)
neo4j-sh (0)$
I don't see why would this be so slow, most of the stuff should be in the RAM anyway. Is it possible to have this in under 100ms or neo4j is not up to that? I could put up the whole db somewhere if that would help..
The biggest puzzle is that the same query runs slower when rewritten to use with different node symbols :)
START n=node(36)
MATCH (n)-[r:COOCCURS_WITH]-(m),
(m)-[s:COOCCURS_WITH]-(p)-[:COOCCURS_WITH]-(n)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
The former always runs in +4.2 seconds, and latter in under 3.8, no matter how many times I run one and another (interleaved)!?
SW/HW details: (advanced) Neo4j v1.9.RC2, JDK 1.7.0.10, a macbook pro with an SSD disk, 8GBRAM, 2 core i7, with the following neo4j config:
neostore.nodestore.db.mapped_memory=550M
neostore.relationshipstore.db.mapped_memory=540M
neostore.propertystore.db.mapped_memory=690M
neostore.propertystore.db.strings.mapped_memory=430M
neostore.propertystore.db.arrays.mapped_memory=230M
neostore.propertystore.db.index.keys.mapped_memory=150M
neostore.propertystore.db.index.mapped_memory=140M
wrapper.java.initmemory=4092
wrapper.java.maxmemory=4092