1
votes

I have installed storm-0.9.2 in a 5-node cluster. I have a simple topology with 1 spout and varying number of bolts (4, 9, 22, 31). For each configuration I have configured (#bolts + 1) workers. Thus for 4 bolts, I have 5 workers, 22 bolts with 23 workers, etc.

I have observed failed worker processes in the worker log files with corresponding EndOfStream exception in the zookeeper.out log file. When I do get a clean test run the number of tuples processed by each bolt is evenly distributed on each worker. On a non-clean test run, the workers that failed attempt to reconnect, however since the number of tuples are finite there are no more tuples to process.

What are the possible causes for a worker process to die?

Excerpt from zookeeper.out log file:

*2014-10-27 17:40:33,198 [myid:] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x1495431347c001e, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:744)
2014-10-27 17:40:33,201 [myid:] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /192.168.0.1:45693 which had sessionid 0x1495431347c001e*

Cluster Environment:

  • Storm 0.9.2
  • Zookeeper 3.4.6
  • Ubuntu 13.10
1
What is your JDK version?Chiron
java version "1.7.0_55" OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.13.10.1) OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)Dennis Ignacio
have you discovered why?freedev

1 Answers

0
votes

To me, it looks like a problem with your Zookeeper. There are a couple of ideas:

  • Your Zookeeper timeout configuration is too small.
  • Your Zookeeper instance doesn't has enough children (slaves) to handle your workload.

For diagnosing, start by increasing the default time out for your Zookeeper instance. If it is not working, try to expand your Zookeeper cluster.

You can consolidate Zookeeper documentation. Please, let us know if that solves your problem.