Had some recent Spark jobs initiated from a Hadoop (HDP-3.1.0.0) client node that raised some
Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
errors and now I see that the NN and RM heap appear stuck at high utilization levels (eg. 80-95%) despite there being to jobs pending or running in the RM/YARN UI.
Yet in the RM UI, there appears to be nothing running:
The errors that I see reported in most recent Spark jobs that failed are...
[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO YarnScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool
[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: ResultStage 10 (csv at NativeMethodAccessorImpl.java:0) finished in 8.558 s
[2021-02-11 22:05:20,982] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: Job 7 finished: csv at NativeMethodAccessorImpl.java:0, took 8.561029 s
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Job null committed.
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Finished processing stats for job null.
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO -
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - writing to local FS staging area
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO -
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO - Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:262)
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO - at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at java.io.DataOutputStream.write(DataOutputStream.java:107)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:96)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:352)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:441)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:305)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:369)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:257)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:228)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - Caused by: java.io.IOException: No space left on device
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at java.io.FileOutputStream.writeBytes(Native Method)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at java.io.FileOutputStream.write(FileOutputStream.java:326)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:260)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - ... 29 more
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO -
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - Traceback (most recent call last):
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO - File "/home/airflow/projects/hph_etl_airflow/common_prep.py", line 112, in <module>
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO - assert get.returncode == 0, "ERROR: failed to copy to local dir"
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO - AssertionError: ERROR: failed to copy to local dir
[2021-02-11 22:05:24,034] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Invoking stop() from shutdown hook
[2021-02-11 22:05:24,040] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO AbstractConnector: Stopped Spark@599cff94{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
[2021-02-11 22:05:24,048] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkUI: Stopped Spark web UI at http://airflowetl.ucera.local:4041
[2021-02-11 22:05:24,092] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Interrupting monitor thread
[2021-02-11 22:05:24,106] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Shutting down all executors
[2021-02-11 22:05:24,107] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - (serviceOption=None,
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - services=List(),
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - started=false)
[2021-02-11 22:05:24,115] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Stopped
[2021-02-11 22:05:24,123] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
[2021-02-11 22:05:24,154] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MemoryStore: MemoryStore cleared
[2021-02-11 22:05:24,155] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManager: BlockManager stopped
[2021-02-11 22:05:24,157] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManagerMaster: BlockManagerMaster stopped
[2021-02-11 22:05:24,162] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
[2021-02-11 22:05:24,173] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Successfully stopped SparkContext
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Shutdown hook called
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-57e1dfa3-26e8-490b-b7ca-94bce93e36d7
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506/pyspark-225760d8-f365-49fe-8333-6d0df3cb99bd
[2021-02-11 22:05:24,646] {bash_operator.py:132} INFO - Command exited with return code 1
[2021-02-11 22:05:24,663] {taskinstance.py:1088} ERROR - Bash command failed
Note: can't do a lot more debugging, since restarted the cluster via Ambari (need it for certain daily tasks, so couldn't just leave it) and it set the NN and RM heap to 10% and 25% respectively.
Anyone know what could be happening here? Any other place that can (still) be checked for further debugging info?