While running my spark jobs on google-cloud-dataproc, I notice that only the master node is being utilized and the CPU utilization of all the worker nodes is nearly zero percent (0.8 percent or so). I have used both the GUI as well as the console to run the code. Do you know any specific reason that could be causing this and how to make the full utilization of the worker nodes?
I submit the jobs in the following manner:
gcloud dataproc jobs submit spark --properties spark.executor.cores=10 --cluster cluster-663c --class ComputeMST --jars gs://kslc/ComputeMST.jar --files gs://kslc/SIFT_full.txt -- SIFT_full.txt gs://kslc/SIFT_fu ll.txt 5.0 12
while(true){
level_counter++;
if(level_counter > (number_of_levels - 1)) break;
System.out.println("LEVEL = " + level_counter);
JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_logn1 = distributed_msts_logn.mapToPair(new next_level());
JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_next_level = distributed_msts_logn1.reduceByKey(new union_eps_nets());
den = den/2;
distributed_msts_logn = distributed_msts_next_level.mapValues(new unit_step_logn(den, level_counter));
}
JavaRDD<epsNet> epsNetsRDDlogn = distributed_msts_logn.values();
List<epsNet> epsNetslogn = epsNetsRDDlogn.collect();
Above is the code, I am trying to run.