2
votes

In a graph with 165k nodes and 266k relationships I'd like to run the following Cypher query:

START n=node:NodeIds('id:firstId'), t=node:NodeIds('id:secondId')   
MATCH (n)-[:RELATIONSHIP_TYPE*1..3]-(t)   
RETURN count(*)

where firstId and secondId is a valid entry for the NodeIds Lucene index.

The query takes about 4 seconds to execute from the Neo4j console and I'd like to understand why is it so slow and how it could be made faster.

The index lookup from this takes about 40ms (i.e. a query just returning the two nodes takes that much) so that can't be the issue.

I run Neo4j on a Windows 8 machine with the default settings by starting from Neo4j.bat. I don't think hardware can be an issue as the query only causes a short 10% CPU spike and a barely visible spike in disk usage.

BTW the first node has a degree of 40, the second 2 and the result is 1.

Any help would be appreciated.

Edit 1, memory config:

I was running Neo4j with OOTB config by starting from Neo4j.bat with the following defaults regarding memory (if I'm not mistaken and those are the only memory-related configs):

wrapper.java.initmemory=16
wrapper.java.maxmemory=64

neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M

Shooting one into the dark I raised these values to the following:

wrapper.java.initmemory=128
wrapper.java.maxmemory=1024

neostore.nodestore.db.mapped_memory=225M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=290M
neostore.propertystore.db.strings.mapped_memory=330M
neostore.propertystore.db.arrays.mapped_memory=330M

This indeed increased Neo4j memory usage (I mean the memory usage of the java.exe instance running Neo4j) without a good increase in performance (the query takes roughly the same time, with probably a 2-300ms increase occasionally). There are GBs of RAM free so there's no hardware constraint.

Edit 2, profiler data: Running the profiler for the query in question yields the following results:

neo4j-sh (0)$ profile START n=node:NodeIds('id:4000'), t=node:NodeIds('id:64599') MATCH path = (n)-[:ASSOCIATIVY_CONNECTION*1..3]-(t) RETURN count(*);
==> +----------+
==> | count(*) |
==> +----------+
==> | 1        |
==> +----------+
==> 1 row
==> 0 ms
==> 
==> ColumnFilter(symKeys=["  INTERNAL_AGGREGATE-939275295"], returnItemNames=["count(*)"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["(  INTERNAL_AGGREGATE-939275295,CountStar)"], _rows=1, _db_hits=0)
==>   ExtractPath(name="path", patterns=["  UNNAMED3=n-[:ASSOCIATIVY_CONNECTION*1..3]-t"], _rows=1, _db_hits=0)
==>     PatternMatch(g="(n)-['  UNNAMED3']-(t)", _rows=1, _db_hits=0)
==>       Nodes(name="t", _rows=1, _db_hits=1)
==>         Nodes(name="n", _rows=1, _db_hits=1)
==>           ParameterPipe(_rows=1, _db_hits=0) 

It says 0ms but I don't know what that is supposed to mean: the result is returned after multiple seconds and the same query executed in the Data Browser's console takes about 3,5s (this is what it displays) and roughly the same amount of time fetched through the RESTful endpoint.

Edit 3, the real data set: Enough with the theory :-), this is the data set what I'm really talking about: http://associativy.com/Media/Default/Associativy/Wiki.zip It's a graph generated by using the interlinks between Wikipedia articles, created from Wikipedia dump files. It's just the beginning.

The real query I'm trying to run is actually the following one, returning the nodes building up the paths between two nodes:

START n=node:NodeIds('id:4000'), t=node:NodeIds('id:64599')   MATCH path = (n)-[:ASSOCIATIVY_CONNECTION*1..3]-(t)   RETURN nodes(path) AS Nodes

I showed the count query because I wanted the simplest query that shows the symptoms.

Edit 4:

I opened another question specifically for the path-returning query.

2
Is this the first run? Or a subsequent one? And could you share your memory config?Michael Hunger
This is a subsequent one (i.e. if I run the query multiple times I still get the same results constantly). I edited the question with the memory config (I hope this is it; I'm new to Neo4j).Piedone
Can you update your post with profile information? I have a feeling you have a supernode connected to one of your endpoints that is blowing up the query. In the shell just prepend "profile " to your query, and it will return the profiling information.Eve Freeman
Hmm, thanks for the tip. Unfortunately the console just says that an unknown error occurred when prepending "profile". I've read that this is supposed to work with M05 but even the live Neo4j Console (console.neo4j.org) returns "Error: expected START or CREATE" when trying to profile the pre-filled query.Piedone
OK, managed to run it with profiling :-). It wasn't clear to me that it has to be done from the Console but not from under Data Browser.Piedone

2 Answers

1
votes

I agree with Wes, this should return in an instant.

You upping of the config makes sense, this is in 2 different config files, right?

As you are running on windows MMIO is inside the java heap, so I would up this to:

wrapper.java.initmemory=4096 wrapper.java.maxmemory=4096

How long is the returned path? Would it make sense in your domain to specify a direction?

Can you please run the following (adapt it to the returned path length)

START n=node:NodeIds('id:4000'), 
      t=node:NodeIds('id:64599') 
MATCH path = (n)-[:ASSOCIATIVY_CONNECTION]-(a)
             (a)-[:ASSOCIATIVY_CONNECTION]-(b)-[:ASSOCIATIVY_CONNECTION]-(t) 
RETURN count(*), count(distinct a), count(a), count(distinct b), count(b);
0
votes

Are you running the 1.9 milestone release? The bidirectional matcher in 1.9 will probably do much better than 1.8.x.