0
votes

Currently I am running a Flink program on a remote cluster of 4 machines using 144 TaskSlots. After running for around 30 minutes I received the following error:

INFO org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet - Info server for jobmanager: Failed to write json updates for job b2eaff8539c8c9b696826e69fb40ca14, because org.eclipse.jetty.io.RuntimeIOException: org.eclipse.jetty.io.EofException at org.eclipse.jetty.io.UncheckedPrintWriter.setError(UncheckedPrintWriter.java:107) at org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:280) at org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:295) at org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.writeJsonUpdatesForJob(JobManagerInfoServlet.java:588) at org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.doGet(JobManagerInfoServlet.java:209) at javax.servlet.http.HttpServlet.service(HttpServlet.java:734) at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113) at org.eclipse.jetty.server.Server.handle(Server.java:352) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596) at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436) at java.lang.Thread.run(Thread.java:745) Caused by: org.eclipse.jetty.io.EofException at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:905) at org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:427) at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:78) at org.eclipse.jetty.server.HttpConnection$Output.flush(HttpConnection.java:1139) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:86) at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:154) at org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:258) at org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:107) at org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:271) ... 24 more Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:185) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:256) at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:849) ... 33 more

I know that java.io.IOException: Broken pipe means that the JobManager lost some kind of connection so I guess the whole job failed and I have to restart it. Although I think the process is not running anymore the WebInterface still lists it as running. Additionally the JobManager is still present when I use jps to identify my running processes on the cluster. So my question is if my job is lost and whether this error is happening randomly sometimes or whether my program caused it.

EDIT: My TaskManagers still send Heartbeats every few seconds and seem to be running.

1

1 Answers

1
votes

It's actually a problem of the JobManagerInfoServlet, Flink's web server, which cannot sent the latest JSON updates of the requested job to your browser because of the java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method). Thus, only the GET request to the server failed.

Such a failure should not affect the execution of the currently running Flink job. Simply refreshing your browser (with Flink's web UI) should send another GET request which then hopefully completes successfully.