1
votes

This question is similar to these two: 16283441, 15456345.

UPDATE: here's a database dump.

In a db of 190K nodes and 727K relationships (and 128MB of database disk usage), I'd like to run the following query:

START start_node=node(<id>) 
MATCH (start_node)-[r:COOCCURS_WITH]-(partner), 
      (partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)  
RETURN COUNT(DISTINCT s) as num_partner_partner_links;

In this db 90% of the nodes have 0 relationships, and the remaining 10% have from 1 up to 670, so the biggest network this query can return cannot possibly have more than 220K links (670*670)/2).

On nodes with less than 10K partner_partner_links the query takes 2-4 seconds, when wormed up. For more connected nodes (20-45K links) it takes about 40-50sec (don't know how much it'd take for the most connected ones).

Specifying relationship direction helps a bit but not much (but then the query doesn't return what I need it to return).

Profiling the query on one of the biggest nodes says:

==> ColumnFilter(symKeys=["  INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6"], returnItemNames=["num_partner_links"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["(  INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6,Distinct)"], _rows=1, _db_hits=0)
==>   PatternMatch(g="(partner)-['r']-(start_node)", _rows=97746, _db_hits=34370048)
==>     TraversalMatcher(trail="(start_node)-[  UNNAMED3:COOCCURS_WITH WHERE true AND true]-(another_partner)-[s:COOCCURS_WITH WHERE true AND true]-(partner)", _rows=116341, _db_hits=117176)
==>       ParameterPipe(_rows=1, _db_hits=0)
neo4j-sh (0)$ 

I don't see why would this be so slow, most of the stuff should be in the RAM anyway. Is it possible to have this in under 100ms or neo4j is not up to that? I could put up the whole db somewhere if that would help..

The biggest puzzle is that the same query runs slower when rewritten to use with different node symbols :)

START n=node(36) 
MATCH (n)-[r:COOCCURS_WITH]-(m), 
      (m)-[s:COOCCURS_WITH]-(p)-[:COOCCURS_WITH]-(n) 
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;

START start_node=node(36) 
MATCH (start_node)-[r:COOCCURS_WITH]-(partner), 
      (partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)  
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;

The former always runs in +4.2 seconds, and latter in under 3.8, no matter how many times I run one and another (interleaved)!?

SW/HW details: (advanced) Neo4j v1.9.RC2, JDK 1.7.0.10, a macbook pro with an SSD disk, 8GBRAM, 2 core i7, with the following neo4j config:

neostore.nodestore.db.mapped_memory=550M
neostore.relationshipstore.db.mapped_memory=540M
neostore.propertystore.db.mapped_memory=690M
neostore.propertystore.db.strings.mapped_memory=430M
neostore.propertystore.db.arrays.mapped_memory=230M
neostore.propertystore.db.index.keys.mapped_memory=150M
neostore.propertystore.db.index.mapped_memory=140M

wrapper.java.initmemory=4092 
wrapper.java.maxmemory=4092
2

2 Answers

0
votes

Change your query to the one below. On my laptop, with significatly lower specs than yours, the execution time halves.

START start_node=node(36) 
MATCH (start_node)-[r:COOCCURS_WITH]-(partner)
WITH start_node, partner
MATCH (partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;

Also, using your settings doesn't affect performance much as compared to the default settings. I'm afraid that you can't get the performance you want, but this query is a step in the right direction.

Usually the traversal API will be faster than Cypher because you explicitly control the traversal. I've mimicked the query as follows:

public class NeoTraversal {

public static void main(final String[] args) {
    final GraphDatabaseService db = new GraphDatabaseFactory()
            .newEmbeddedDatabaseBuilder("/neo4j")
            .loadPropertiesFromURL(NeoTraversal.class.getClassLoader().getResource("neo4j.properties"))
            .newGraphDatabase();
    final Set<Long> uniquePartnerRels = new HashSet<Long>();
    long startTime = System.currentTimeMillis();
    final Node start = db.getNodeById(36);
    for (final Path path : Traversal.description()
            .breadthFirst()
            .relationships(Rel.COOCCURS_WITH, Direction.BOTH)
            .uniqueness(Uniqueness.NODE_GLOBAL)
            .evaluator(Evaluators.atDepth(1))
            .traverse(start)) {
        Node partner = start.equals(path.startNode()) ? path.endNode() : path.startNode();
        for (final Path partnerPath : Traversal.description()
                .depthFirst()
                .relationships(Rel.COOCCURS_WITH, Direction.BOTH)
                .uniqueness(Uniqueness.RELATIONSHIP_PATH)
                .evaluator(Evaluators.atDepth(2))
                .evaluator(Evaluators.includeWhereEndNodeIs(start))
                .traverse(partner)) {
            uniquePartnerRels.add(partnerPath.relationships().iterator().next().getId());
        }
    }
    System.out.println("Execution time: " + (System.currentTimeMillis() - startTime));
    System.out.println(uniquePartnerRels.size());
}

static enum Rel implements RelationshipType {
    COOCCURS_WITH
}

}

This clearly outperforms the cypher query, thus this might be a good alternative for you. Optimization is likely still possible.

0
votes

Seems like for anything but depth/breadth first traversal, neo4j is not that "blazing fast". I've solved the problem by precomputing all the networks and storing them into MongoDB. A node document describing a network looks like this:

{
    node_id : long, 
    partners : long[],
    partner_partner_links : long[]
}

Partners and partner_partner_links are ids of documents describing egdes. Fetching the whole network takes 2 queries: one for this document, and another for edge properties (which also holds node properties):

db.edge.find({"_id" : {"$in" : network.partner_partner_links}});