We have a neo4j graph database with around 60 million nodes and an equivalent relationships.
We have been facing consistent packet drops and delays in processing and a complete hung server after 2 hours. We had to shutdown and restart our servers every time this happens and we are having trouble understanding where we went wrong with our configuration.
We are seeing the following kind of exceptions in the console.log file -
- java.lang.IllegalStateException: s=DISPATCHED i=true a=null o.e.jetty.server.HttpConnection - HttpConnection@609c1158{FILLING}
- java.lang.IllegalStateException: s=DISPATCHED i=true a=null o.e.j.util.thread.QueuedThreadPool
- java.lang.IllegalStateException: org.eclipse.jetty.util.SharedBlockingCallback$BlockerTimeoutException
- o.e.j.util.thread.QueuedThreadPool - Unexpected thread death: org.eclipse.jetty.util.thread.QueuedThreadPool$3@59d5a975 in qtp1667455214{STARTED,14<=21<=21,i=0,q=58}
- org.eclipse.jetty.server.Response - Committed before 500 org.neo4j.server.rest.repr.OutputFormat$1@39beaadf
- o.e.jetty.servlet.ServletHandler - /db/data/cypher java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1253) ~[jetty-server-9.2.
- org.eclipse.jetty.server.HttpChannel - /db/data/cypher java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1253) ~[jetty-server-9.2.
- org.eclipse.jetty.server.HttpChannel - Could not send response error 500: java.lang.IllegalStateException: Committed o.e.jetty.server.ServerConnector - Stopped
- o.e.jetty.servlet.ServletHandler - /db/data/cypher org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
We are using neo4j enterprise edition 2.2.5 server in SINGLE/NON CLUSTER mode on Azure D series 8 core CPU,56 GB RAM UBUNTU 14.04 LTS machine with an attached 500GB data disk.
Here is a snapshot of the sizes of neostore files
- 8.5G Oct 2 15:48 neostore.propertystore.db
- 15G Oct 2 15:48 neostore.relationshipstore.db
- 2.5G Oct 2 15:48 neostore.nodestore.db
- 6.9M Oct 2 15:48 neostore.relationshipgroupstore.db
- 3.7K Oct 2 15:07 neostore.schemastore.db
- 145 Oct 2 15:07 neostore.labeltokenstore.db
- 170 Oct 2 15:07 neostore.relationshiptypestore.db
The Neo4j configuration is as follows -
- Allocated 30GB to file buffer cache (dbms.pagecache.memory=30G)
- Allocated 20GB to JVM heap memory (wrapper.java.initmemory=20480, wrapper.java.maxmemory=20480)
- Using the default hpc(High performance) type cache.
- Forcing the RULE planner by default (dbms.cypher.planner=RULE)
- Maximum threads processing queries is 16(twice the number of cores) - org.neo4j.server.webserver.maxthreads=16
- Transaction timeout of 60 seconds - org.neo4j.server.transaction.timeout=60
- Guard Timeout if query execution time is greater than 10 seconds - org.neo4j.server.webserver.limit.executiontime=10000
Rest of the settings are default
We actually want to setup a cluster of 3 nodes but before that we want to be sure if our basic configuration is correct. Please help us
--------------------------------------------------------------------------
EDITED to ADD Query Sample
Typically our cypher query frequency is 18K queries in an hour with an average of roughly 5-6 queries a second. There are also times when there are about 80 queries per second.
Our Typical Queries look like the ones below
match (a:TypeA {param:{param}})-[:RELA]->(d:TypeD) with distinct d,a skip {skip} limit 100 optional match (d)-[:RELF]->(c:TypeC)<-[:RELF]-(b:TypeB)<-[:RELB]-(a) with distinct d,a,collect(distinct b.bid) as bids,collect(distinct c.param3) as param3Coll optional match (d)-[:RELE]->(p:TypeE)<-[:RELE]-(b1:TypeB)<-[:RELB]-(a) with distinct d as distD,bids+collect(distinct b1.bid) as tbids,param3Coll,collect(distinct p.param4) as param4Coll optional match (distD)-[:RELC]->(f:TypeF) return id(distD),distD.param5,exists((distD)<-[:RELG]-()) as param6, tbids,param3Coll,param4Coll,collect(distinct id(f)) as fids
match (a:TypeA {param:{param}})-[:RELB]->(b) return count(distinct b)
MATCH (a:TypeA{param:{param}})-[r:RELD]->(a1)-[:RELH]->(h) where r.param1=true with a,a1,h match (h)-[:RELL]->(d:TypeI) where (d.param2/2)%2=1 optional match (a)-[:RELB]-(b)-[:RELM {param3:true}]->(c) return a1.param,id(a1),collect(b.bid),c.param5
match (a:TypeA {param:{param}}) match (a)-[:RELB]->(b) with distinct b,a skip {skip} limit 100 match (a)-[:RELH]->(h1:TypeH) match (b)-[:RELF|RELE]->(x)<-[:RELF|RELE]-(h2:TypeH)<-[:RELH]-(a1) optional match (a1)<-[rd:RELD]-(a) with distinct a1,a,h1,b,h2,rd.param1 as param2,collect(distinct x.param3) as param3s,collect(distinct x.param4) as param4s optional match (a1)-[:RELB]->(b1) where b1.param7 in [0,1] and exists((b1)-[:RELF|RELE]->()<-[:RELF|RELE]-(h1)) with distinct a1,a,b,h2,param2,param3s,param4s,b1,case when param2 then false else case when ((a1.param5 in [2,3] or length(param3s)>0) or (a1.param5 in [1,3] or length(param4s)>0)) then case when b1.param7=0 then false else true end else false end end as param8 MERGE (a)-[r2:RELD]->(a1) on create set r2.param6=true on match set r2.param6=case when param8=true and r2.param9=false then true else false end MERGE (b)-[r3:RELM]->(h2) SET r2.param9=param8, r3.param9=param8
MATCH (a:TypeA {param:{param}})-[:RELI]->(g:TypeG {type:'type1'}) match (g)<-[r:RELI]-(a1:TypeA)-[:RELJ]->(j)-[:RELK]->(g) return distinct g, collect(j.displayName), collect(r.param1), g.gid, collect(a1.param),collect(id(a1))
match (a:TypeA {param:{param}})-[r:RELD {param2:true}]->(a1:TypeA)-[:RELH]->(b:TypeE) remove r.param2 return id(a1),b.displayName, b.firstName,b.lastName
match (a:TypeA {param:{param}})-[:RELA]->(b:TypeB) return a.param1,count(distinct id(b))
MATCH (a:TypeA {param:{param}}) set a.param1=true;
match (a:TypeE)<-[r:RELE]-(b:TypeB) where a.param4 in {param4s} delete r return count(b);
MATCH (a:TypeA {param:{param}}) return id(a);
Adding a few more strange things I have been noticing....
I am have stopped all my webservers. So, currently there are no incoming requests to neo4j. However I see that there are about 40K open file handles in TCP close/wait state implying the client has closed its connection because of time out and Neo4j has not processed it and responded to that request. I also see (from messages.log) that Neo4j server is still processing queries and as it does this, the 40K open file handles is slowly reducing. By the time I write this post there are about 27K open file handles in TCP close/wait state.
Also I see that the queries are not continuously processed. Every once in a while I am seeing a pause in messages.log and I see these messages about log rotation because of some out of order sequence as below
Rotating log version:5630
2015-10-04 05:10:42.712+0000 INFO [o.n.k.LogRotationImpl]: Log Rotation [5630]: Awaiting all transactions closed...
2015-10-04 05:10:42.712+0000 INFO [o.n.k.i.s.StoreFactory]: Waiting for all transactions to close...
committed: out-of-order-sequence:95494483 [95494476]
committing: 95494483
closed: out-of-order-sequence:95494480 [95494246]
2015-10-04 05:10:43.293+0000 INFO [o.n.k.LogRotationImpl]: Log Rotation [5630]: Starting store flush...
2015-10-04 05:10:44.941+0000 INFO [o.n.k.i.s.StoreFactory]: About to rotate counts store at transaction 95494483 to [/datadrive/graph.db/neostore.counts.db.b], from [/datadrive/graph.db/neostore.counts.db.a].
2015-10-04 05:10:44.944+0000 INFO [o.n.k.i.s.StoreFactory]: Successfully rotated counts store at transaction 95494483 to [/datadrive/graph.db/neostore.counts.db.b], from [/datadrive/graph.db/neostore.counts.db.a].
I also see these messages once in a while
2015-10-04 04:59:59.731+0000 DEBUG [o.n.k.EmbeddedGraphDatabase]: NodeCache array:66890956 purge:93 size:1.3485746GiB misses:0.80978173% collisions:1.9829895% (345785) av.purge waits:13 purge waits:0 avg. purge time:110ms
or
2015-10-04 05:10:20.768+0000 DEBUG [o.n.k.EmbeddedGraphDatabase]: RelationshipCache array:66890956 purge:0 size:257.883MiB misses:10.522135% collisions:11.121769% (5442101) av.purge waits:0 purge waits:0 avg. purge time:N/A
All of this is happening when there are no incoming requests and neo4j is processing old pending 40K requests as I mentioned above.
Since, it is a dedicated server, should not the server be processing the queries continuously without such a large pending queue? Am I missing something here? Please help me
data/graph.db/messages.log
. – Stefan Armbruster