I'm building a db for syntactically annotated sentences written in natural language. The database structure is pretty specific so I was wondering if I could use neo4j for that task.
The database consists of many isolated graphs (representing sentences), each graph is a chain of nodes, something like NODE1-[:NEXT]->NODE2-[:NEXT]->NODE3
, each node being some word with properties. The majority of queries to the db are like START x=node:nodes(lemma="buy") MATCH x-[:NEXT]->y-[:NEXT]->z RETURN x,y,z
, so basically the aim is just to extract ngrams. I use a simple index based on word lemmas.
There are 65 million nodes, 270 million properties and 110 million relationships in the db.
I'm using neo4j 2.0.0 M-06.
The problem is that neo4j takes too much time to perform such ngram queries. For example, the query above takes like 140+ seconds. It seems to depend on the number of starting nodes found in the index. If the number is large (~50k), the query lags.
I have tried querying with cypher through webadmin, cypher through java and traversal framework and it looks like there is some problem with retrieving items from index, like it somehow collects the items while I'm iterating through them. In cypher-java when I execute the query it takes like 500ms, but then when I call the iterator, it takes the 140+ seconds mentioned above.
Could anyone tell me if there may be something like that, or anything else that could cause such an issue? Maybe there's an effective way to deal with such many-starting-nodes-but-simple-match-condition queries?
I'd like to stick to Cypher if it's possible, because I find it elegant and expressive and it would be great if the problem were somewhere else :)
START x=node:nodes(lemma="buy") return count(*)
. How much memory do you run your Neo4j server with? How many rows does your query return? – Michael Hunger