When running spark cluster on dataproc with only 2 nonpreemptable worker nodes and other 100~ preemptable nodes I sometimes get a cluster that is not usable at all due to too many connection errors, datanode errors, lost executors but still being tracked for the heartbeat... Always getting errors like that:
18/08/08 15:40:11 WARN org.apache.hadoop.hdfs.DataStreamer: Error Recovery for BP-877400388-10.128.0.31-1533740979408:blk_1073742308_1487 in pipeline [DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK], DatanodeInfoWithStorage[10.128.0.7:9866,DS-9f1d8b17-0fee-41c7-9d31-8ad89f0df69f,DISK]]: datanode 0(DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK]) is bad.
And errors reporting Slow ReadProcessor read fields for block BP-877400388-10.128.0.31-1533740979408:blk_1073742314_1494
from what I see there appear to be something not functioning correctly for those clusters but nothing is reported to indicate that.
Plus the application master is also created on a preemptable node why is that?