I'm working on a Distributed Deep learning project using Apache Hadoop, Spark and DL4J.
My main issue is when starting my Application on spark it gets to Running state and never gets higher than 10% progress I get this Warning
2019-08-23 20:55:49,198 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
2019-08-23 20:55:49,224 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at saveAsTextFile at BaseTrainingMaster.java:211) (first 15 tasks are for partitions Vector(0, 1))
2019-08-23 20:55:49,226 INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 2 tasks
2019-08-23 20:56:04,286 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:17,526 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:23,135 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
these last 3 lines keeps going non-stop
Actually I have only 1 Master and 1 Slave node with Hadoop and Spark installed
- Master is at 8GB of RAM with intel i5 6500
- Slave is at 4GB RAM with intel i3 4400
After checking the WebUI and Log files for HDFS I can see that HDFS is working with no problems Yarn WebUI and Logs also show that Yarn is working fine with 1 DATANODE
Here you can check my code and see where it gets stuck
VoidConfiguration config = VoidConfiguration.builder()
.unicastPort(40123)
.networkMask("192.168.0.0/42")
.controllerAddress("192.168.1.35")
.build();
log.log(Level.INFO,"==========After voidconf");
// Create the TrainingMaster instance
TrainingMaster trainingMaster = new SharedTrainingMaster.Builder(config, 1)
.batchSizePerWorker(10)
.workersPerNode(1)
.build();
log.log(Level.INFO,"==========after training master");
SparkDl4jMultiLayer sparkNet = new SparkDl4jMultiLayer(sc, conf, trainingMaster);
log.log(Level.INFO,"==========after sparkMultilayer");
// Execute training:
log.log(Level.INFO,"==========Starting training");
for (int i = 0; i < 100; i++) {
log.log(Level.INFO,"Epoch : " + i); // this is the Last line from my code that is printed in the Log
sparkNet.fit(rddDataSetClassification); //it gets stuck here
log.log(Level.INFO,"Epoch : " + i + " / " + i);
}
log.log(Level.INFO,"after training");
// Dataset Evaluation
Evaluation eval = sparkNet.evaluate(rddDataSetClassification);
log.log(Level.INFO, eval.stats());
yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.1.35</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>256</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
spark-defult.conf:
spark.master yarn
spark.driver.memory 2500m
spark.yarn.am.memory 2500m
spark.executor.memory 2000m
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080
I'm suspecting any ressources problem so I have tried setting properties like setting spark.executor.cores and spark.executor.instances to 1 I tried also changing memory allocations in both yarn and spark getting up and down(I'm not sure how it really works)
Logs from spark.deploy.master....out
2019-08-23 20:18:33,669 INFO master.Master: I have been elected leader! New state: ALIVE
2019-08-23 20:18:40,771 INFO master.Master: Registering worker 192.168.1.37:42869 with 4 cores, 2.8 GB RAM
Logs from spark.deploy.worker....out
19/08/23 20:18:40 INFO Worker: Connecting to master hadoop-MS-7A75:7077...
19/08/23 20:18:40 INFO TransportClientFactory: Successfully created connection to hadoop-MS-7A75/192.168.1.35:7077 after 115 ms (0 ms spent in bootstraps)
19/08/23 20:18:40 INFO Worker: Successfully registered with master spark://hadoop-MS-7A75:7077