2
votes

I'm building a db for syntactically annotated sentences written in natural language. The database structure is pretty specific so I was wondering if I could use neo4j for that task.

The database consists of many isolated graphs (representing sentences), each graph is a chain of nodes, something like NODE1-[:NEXT]->NODE2-[:NEXT]->NODE3, each node being some word with properties. The majority of queries to the db are like START x=node:nodes(lemma="buy") MATCH x-[:NEXT]->y-[:NEXT]->z RETURN x,y,z, so basically the aim is just to extract ngrams. I use a simple index based on word lemmas.

There are 65 million nodes, 270 million properties and 110 million relationships in the db.

I'm using neo4j 2.0.0 M-06.

The problem is that neo4j takes too much time to perform such ngram queries. For example, the query above takes like 140+ seconds. It seems to depend on the number of starting nodes found in the index. If the number is large (~50k), the query lags.

I have tried querying with cypher through webadmin, cypher through java and traversal framework and it looks like there is some problem with retrieving items from index, like it somehow collects the items while I'm iterating through them. In cypher-java when I execute the query it takes like 500ms, but then when I call the iterator, it takes the 140+ seconds mentioned above.

Could anyone tell me if there may be something like that, or anything else that could cause such an issue? Maybe there's an effective way to deal with such many-starting-nodes-but-simple-match-condition queries?

I'd like to stick to Cypher if it's possible, because I find it elegant and expressive and it would be great if the problem were somewhere else :)

1
How long does it take when you just do START x=node:nodes(lemma="buy") return count(*). How much memory do you run your Neo4j server with? How many rows does your query return?Michael Hunger
Shouldn't be those 50k entries the same node? If you look for that single word?Michael Hunger

1 Answers

0
votes

Since you're on 2.0.0-M06, how about using labels here?

Don't know details about your graph model, but you could try to use the lemma' value as label. In this case your query would look like:

MATCH (x:buy)-[:NEXT]->y-[:NEXT]->z RETURN x,y,z

Another idea would be to have one category node for lemma=buy. All nodes referring to this have a relationship to the category node. To distinguish the category nodes from the rest you might use labels as well. In this case the index lookup will just return the category node and you'll basically use in-graph indexing:

MATCH (c:Category)<-[:HAS_CATEGORY]-(x)-[:NEXT]->y-[:NEXT]->z 
WHERE c.lemma = 'buy'
RETURN x,y,z

(here you should use a schema index CREATE INDEX ON :Category(lemma)