0
votes

I'm working on a Distributed Deep learning project using Apache Hadoop, Spark and DL4J.

My main issue is when starting my Application on spark it gets to Running state and never gets higher than 10% progress I get this Warning

2019-08-23 20:55:49,198 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
2019-08-23 20:55:49,224 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at saveAsTextFile at BaseTrainingMaster.java:211) (first 15 tasks are for partitions Vector(0, 1))
2019-08-23 20:55:49,226 INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 2 tasks
2019-08-23 20:56:04,286 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:17,526 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-08-23 20:56:23,135 WARN cluster.YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

these last 3 lines keeps going non-stop

Actually I have only 1 Master and 1 Slave node with Hadoop and Spark installed

  • Master is at 8GB of RAM with intel i5 6500
  • Slave is at 4GB RAM with intel i3 4400

After checking the WebUI and Log files for HDFS I can see that HDFS is working with no problems Yarn WebUI and Logs also show that Yarn is working fine with 1 DATANODE

Here you can check my code and see where it gets stuck

VoidConfiguration config = VoidConfiguration.builder()
            .unicastPort(40123)
            .networkMask("192.168.0.0/42")   
            .controllerAddress("192.168.1.35")  
            .build();

    log.log(Level.INFO,"==========After voidconf");

    //      Create the TrainingMaster instance
    TrainingMaster trainingMaster = new SharedTrainingMaster.Builder(config, 1)
            .batchSizePerWorker(10) 
            .workersPerNode(1)      
            .build();

    log.log(Level.INFO,"==========after training master");
    SparkDl4jMultiLayer sparkNet = new SparkDl4jMultiLayer(sc, conf, trainingMaster);


    log.log(Level.INFO,"==========after sparkMultilayer");
    //      Execute training:
    log.log(Level.INFO,"==========Starting training");
    for (int i = 0; i < 100; i++) {
        log.log(Level.INFO,"Epoch : " + i); // this is the Last line from my code that is printed in the Log
        sparkNet.fit(rddDataSetClassification); //it gets stuck here 
        log.log(Level.INFO,"Epoch : " + i + " / " + i);
    }
    log.log(Level.INFO,"after training");
    //      Dataset Evaluation
    Evaluation eval = sparkNet.evaluate(rddDataSetClassification);
    log.log(Level.INFO, eval.stats());

yarn-site.xml

<property>
            <name>yarn.acl.enable</name>
            <value>0</value>
    </property>

    <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>192.168.1.35</value>
    </property>

    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>

<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>3072</value>
</property>

<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>3072</value>
</property>

<property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>256</value>
</property>

<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>

spark-defult.conf:

spark.master        yarn
spark.driver.memory     2500m
spark.yarn.am.memory    2500m
spark.executor.memory   2000m
spark.eventLog.enabled      true
spark.eventLog.dir      hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs://hadoop-MS-7A75:9000/spark-logs
spark.history.fs.update.interval  10s
spark.history.ui.port             18080

I'm suspecting any ressources problem so I have tried setting properties like setting spark.executor.cores and spark.executor.instances to 1 I tried also changing memory allocations in both yarn and spark getting up and down(I'm not sure how it really works)

Logs from spark.deploy.master....out

2019-08-23 20:18:33,669 INFO master.Master: I have been elected leader! New state: ALIVE
2019-08-23 20:18:40,771 INFO master.Master: Registering worker 192.168.1.37:42869 with 4 cores, 2.8 GB RAM

Logs from spark.deploy.worker....out

19/08/23 20:18:40 INFO Worker: Connecting to master hadoop-MS-7A75:7077...
19/08/23 20:18:40 INFO TransportClientFactory: Successfully created connection to hadoop-MS-7A75/192.168.1.35:7077 after 115 ms (0 ms spent in bootstraps)
19/08/23 20:18:40 INFO Worker: Successfully registered with master spark://hadoop-MS-7A75:7077
1

1 Answers

0
votes

Fixed the issue by adding another Slave I don't know why and how it worked but when I added another slave it worked