How to run this code on Spark Cluster mode

Question

I want to run my code on a Cluster: my code:

import java.util.Properties

import edu.stanford.nlp.ling.CoreAnnotations._
import edu.stanford.nlp.pipeline._
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

object Pre2 {

  def plainTextToLemmas(text: String, pipeline: StanfordCoreNLP): Seq[String] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val lemmas = new ArrayBuffer[String]()
    val sentences = doc.get(classOf[SentencesAnnotation])
    for (sentence <- sentences; token <- sentence.get(classOf[TokensAnnotation])) {
      val lemma = token.get(classOf[LemmaAnnotation])
      if (lemma.length > 0 ) {
        lemmas += lemma.toLowerCase
      }
    }
    lemmas
  }
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("pre2")

    val sc = new SparkContext(conf)
      val plainText = sc.textFile("data/in.txt")
      val lemmatized = plainText.mapPartitions(p => {
        val props = new Properties()
        props.put("annotators", "tokenize, ssplit, pos, lemma")
        val pipeline = new StanfordCoreNLP(props)
        p.map(q => plainTextToLemmas(q, pipeline))
      })
      val lemmatized1 = lemmatized.map(l => l.head + l.tail.mkString(" "))
      val lemmatized2 = lemmatized1.filter(_.nonEmpty)
      lemmatized2.coalesce(1).saveAsTextFile("data/out.txt)
  }
}

and Cluster features:

2 Nodes

each node has : 60g RAM

each node has : 48 Cores

Shared Disk

I installed Spark on this cluster and one of these nodes is as a master and worker and another node is a worker .

when i run my code with this command in terminal :

./bin/spark-submit --master spark://192.168.1.20:7077 --class Main --deploy-mode cluster code/Pre2.jar

it shows :

15/08/19 15:27:21 WARN RestSubmissionClient: Unable to connect to server spark://192.168.1.20:7077. Warning: Master endpoint spark://192.168.1.20:7077 was not a REST server. Falling back to legacy submission gateway instead. 15/08/19 15:27:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Driver successfully submitted as driver-20150819152724-0002 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150819152724-0002 is RUNNING Driver running on 1192.168.1.19:33485 (worker-20150819115013-192.168.1.19-33485)

How can i run above code on Spark standalone cluster ?

Your message says RUNNING, it appears to be running correctly. — mattinbits
... Does the UI give any more details on the reason for the failure? — mattinbits
You're stating --class Main but you don't seem to have a class called Main also you're hard coding master to be local — mattinbits

Payf1 Payf1 · Accepted Answer · 2016-10-25T21:32:15

Make sure you check out the WebUI using 8080 port. In your example it would be 192.168.1.20:8080.

If you are running it in Spark Standalone Cluster mode, try it without --deploy-mode cluster and hard code your nodes memory by adding --executor-memory 60g

How to run this code on Spark Cluster mode

2 Answers