16
votes

When running in a cluster, if something wrong happens, a worker generally dies (JVM shutdown). It can be caused by many factors, most of the time it is a challenge (the biggest difficulty with storm?) to find out what causes the crash.

Of course, storm-supervisor restarts dead workers and liveness is quite good within a storm cluster, still a worker crash is a mess that we should avoid as it adds overhead, latency (can be very long until a worker is found dead and respawned) and data loss if you didn't design your topology to prevent that.

Is there an easy way / tool / methodology to check when and possibly why a storm worker crashes? They are not shown in storm-ui (whereas supervisors are shown), and everything needs manual monitoring (with jstack + JVM opts for instance) with a lot of care.

Here are some cases that can happen:

  • timeouts and many possible reasons: slow java garbage collection, bad network, bad sizing in timeout configuration. The only output we get natively from supervisor logs is "state: timeout" or "state: disallowed" which is poor. Also when a worker dies the statistics on storm-ui are rebooted. As you get scared of timeouts you end up using long ones which does not seem to be a good solution for real-time processing.
  • high back pressure with unexpected behaviour, starving worker heartbeats and inducing a timeout for instance. Acking seems to be the only way to deal with back pressure and needs good crafting of bolts according to your load. Not acking seems to be a no-go as it would indeed crash workers and get bad results in the end (even less data processed than an acking topology under pressure?).
  • code runtime exceptions, sometimes not shown in storm-ui that need manual checking of application logs (the easiest case).
  • memory leaks that can be found out with JVM dumps.
1
timeouts and backpressure don't sound like JVM crashes, they're application behavior. Do you need general monitoring for all VMs in your network or are you trying to diagnose a specific issue in-depth? - the8472
@the8472 yes you are right, this is storm related. A worker is a java process with its own JVM instance managed by storm. The bad stuff is that it always fails silently (nearly nothing in the worker logs). It is spawned and/or killed by another java process named "storm-supervisor" which does not log a lot too. I asked this question so maybe we can define here a proper method for monitoring worker crashes and making debug tasks easier. - zenbeni
I guess the first step would be figuring out whether the workers get shut down by the supervisor or exit uncleanly due to error conditions within the process. Different potential causes probably have to be attacked differently. - the8472
Notice that the supervisor usually kill -9's, without necessarily warning nodes that it is about to do so. I recommend getting your bolts to do more detailed logging; in particular, make sure you log all exceptions. When restarted, have them send you their last logs over the net. If nothing shows up in logs, then it's the supervisor killing them. - tucuxi

1 Answers

1
votes

The storm supervisor logs restart by timeout. you can monitor the supervisor log, also you can monitor your bolt's execute(tuple) method's performance.

As for memory leak, since storm supervisor does kill -9 the worker, the heap dump is likely to be corrupted, so i would use tools that monitor your heap dynamically or killing the supervisor to produce heap dumps via jmap. Also, try monitoring the gc logs.

I still recommend increasing the default timeouts.