2
votes

I have a Neo4j database with 7340 nodes. Each node has a label (neoplasm) and 2 properties (conceptID and fullySpecifiedName). Autoindexing is enabled on both properties, and I have created a schema index on neoplasm:conceptID and neoplasm:fullySpecifiedName. The nodes are concepts in a terminology tree. There is a single root node and the others descend often via several paths to a depth of up to 13 levels. From a SQL Server implementation, the hierarchy structure is as follows...

Depth Relationship Count
0     1
1     37
2     360
3     1598
4     3825
5     6406
6     7967
7     7047
8     4687
9     2271
10    825
11    258
12    77
13    3

I am adding the relationships using a C# program and neo4jclient which contructs and executes cypher queries like this one...

MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000"   AND parent.conceptID="372095001"   
CREATE child-[:ISA]->parent

Adding the relationships up to level 3 was very fast, and level 4 itself was not bad, but at level 5 things started getting very slow, an average of over 9 seconds per relationship.

The example query above was executed through the http://localhost:7474/browser/ interface and took 12917ms, so the poor execution times are not a feature of the C# code nor the neo4jclient API.

I thought graph databases were supposed to be blindingly fast and that the performance was independent of size.

So far I have added just 9033 out of 35362 relationships. Even if the speed does not degrade further as the number of relationships increases, it will take over three days to add the remainder!

Can anyone suggest why this performance is so bad? Or is write performance of this nature normal, and it is just read performance that is so good. A sample Cypher query to return parents of a level 5 node returns a list of 23 fullySpecifiedName properties in less time than I can measure with a stop watch! (well under a second).

3
Do you have an index on :neoplasm(conceptId)? Traversals are cheap, but lookups by id still require approaches like indexing.Tatham Oddie
To verify that the index is really used can you post the query plan printed when "PROFILE MATCH (child:neoplasm), (parent:neoplasm) WHERE child.conceptID = "448257000" AND parent.conceptID="372095001" CREATE child-[:ISA]->parent" is executed in the shell?Stefan Armbruster

3 Answers

2
votes

When using different Indexes on labels at the same time, Cypher does not (yet) choose these to make the query faster, instead, try giving hints to use them, see http://docs.neo4j.org/chunked/milestone/query-using.html#using-query-using-multiple-index-hints

PROFILE
MATCH (child:neoplasm), (parent:neoplasm)
WHERE child.conceptID = "448257000"   AND parent.conceptID="372095001"   
USING INDEX child:neoplasm(conceptID)
USING INDEX parent:neoplasm(conceptID)
CREATE child-[:ISA]->parent

Does that improve things? Also, please post the PROFILE output for better insight.

1
votes

You said you're using autoindexing. However your query would use schema indexes and not autoindexes. Autoindexes index nodes based on properties and are not tied to labels. Schema indexes are a new and stunning feature of Neo4j 2.0.

So get rid of the autoindexes and, as Tatham suggested, create schema indexes using:

CREATE INDEX ON :neoplasm(conceptId)

Even with schema indexes inserting relationships will become slower as your graph grows since indexes typically scale at log(n) level. However it should be much faster then the times you've observed.

0
votes

I appear to have found the answer. I restarted the Neop4j database (Neop4j 2.0.0-M06) and got the usual message of Neo4j will be ready in a few seconds. Over half an hour later the status turned green. During that time I was monitoring the process and it appeared to be rebuilding the lucene indexes.

I have since tried loading more relationships and they are now being added at an acceptable rate (~100msec per relationship).

Thanks for the comments