1
votes

I am trying to run a spark scala application on Amazon EMR cluster using Amazon Data-Pipeline. The step was added as follows in EMRActivity:

command-runner.jar,spark-submit,--deploy-mode,cluster,--class,com.demo.GettingStarted,s3://myBucket/sampleApps/HelloWorld.jar

After looking into the EMR logs, the job is failing consistently with following stack trace:

Exception in thread "main" org.apache.spark.SparkException: Application

application_1517065923932_0001 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1034)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)

What may be the possbile cause of this error?

This is a sample app that prints Hello World to console.

The same job works on AWS EMR.

2

2 Answers

0
votes

Could you please check the application logs in Resource Manager. If you enable Hue service in EMR, you can able to view the logs from UI. In Hue,go to workflows --> DashBoard -> workflow and see the job and container logs. I suspect that there might be chance that Oozie couldn't able to parse the spark-defaults.conf parameters.

When I configured the following properties in spark-defaults.conf, Oozie couldn't able to parse the configuration if it contains spaces in oozie version 4.3.0.

spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p' spark.dynamicAllocation.enabled true spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'

0
votes

One of the reasons I think is that you should specify the path of the jar as {input.directoryPath}/HelloWorld.jar,where the input.directoryPath comes from using the S3DataNode.

It would be better though if you can find the exact logs.

To view the logs generated by the driver/executor,go to : pipelineLogUri->EmrClusterId->{latest_run}->containers->application->container. The last step i.e container, for driver logs select the container having id:-1, and the rest of the logs (2,3,4...) will be logs generated via the executor instances.