Worker Nodes not being used in GCE

Question

While running my spark jobs on google-cloud-dataproc, I notice that only the master node is being utilized and the CPU utilization of all the worker nodes is nearly zero percent (0.8 percent or so). I have used both the GUI as well as the console to run the code. Do you know any specific reason that could be causing this and how to make the full utilization of the worker nodes?

I submit the jobs in the following manner:

gcloud dataproc jobs submit spark --properties spark.executor.cores=10 --cluster cluster-663c --class ComputeMST --jars gs://kslc/ComputeMST.jar --files gs://kslc/SIFT_full.txt -- SIFT_full.txt gs://kslc/SIFT_fu ll.txt 5.0 12

	while(true){

	level_counter++;

	if(level_counter > (number_of_levels - 1)) break;
	
	System.out.println("LEVEL = " + level_counter);
	
	JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_logn1 = distributed_msts_logn.mapToPair(new next_level());
	
	JavaPairRDD<ArrayList<Integer>, epsNet> distributed_msts_next_level = distributed_msts_logn1.reduceByKey(new union_eps_nets());

	den = den/2;
	
	distributed_msts_logn = distributed_msts_next_level.mapValues(new unit_step_logn(den, level_counter));		
	}

	JavaRDD<epsNet> epsNetsRDDlogn =  distributed_msts_logn.values();

	List<epsNet> epsNetslogn = epsNetsRDDlogn.collect();

Above is the code, I am trying to run.

Jas Bali Jas Bali · Accepted Answer · 2018-04-02T19:43:09

You are doing a collect() in your driver program. What are you trying to achieve? Doing a collect will definitely hammer your master node resources, since driver will be collecting the results here. Generally you want to ingest data into spark (using read or parallelize on spark context), do in-memory map-reduce (transformations) and then take data out of the spark world (example, writing a parquet to hdfs) to do any collect-related stuff. Also, ensure via spark UI that you have all the executors that you asked for with given cores and memory.

Worker Nodes not being used in GCE

1 Answers