1
votes

I wrote a Spark application, that reads some CSV files (~5-10 GB), transforms the data and converts the data into HFiles. The data is read from and saved into HDFS.

Everything seems to work fine when I run the application in the yarn-client mode.

But when I try to run it as yarn-cluster application, the process seems not to run the final saveAsNewAPIHadoopFile action on my transformed and ready-to-save RDD!

Here is a snapshot of my Spark UI, where you can see that all the other Jobs are processed:

enter image description here

And the corresponding Stages:

enter image description here

Here the last step of my application where the saveAsNewAPIHadoopFile method is called:

JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ...

try {
    Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab");
    Configuration baseConf = c.getConfiguration();
    baseConf.set("hbase.zookeeper.quorum", HBASE_HOST);
    baseConf.set("zookeeper.znode.parent", "/hbase-secure");

    Job job = Job.getInstance(baseConf, "Test Bulk Load");
    HTable table = new HTable(baseConf, "map_data");        

    HBaseAdmin admin = new HBaseAdmin(baseConf);        
    HFileOutputFormat2.configureIncrementalLoad(job, table);            
    Configuration conf = job.getConfiguration();        

    cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
    System.out.println("Finished!!!!!");
} catch (IOException e) {
    e.printStackTrace();
    System.out.println(e.getMessage());
} 

I'm running the appliaction via spark-submit --master yarn --deploy-mode cluster --class sparkhbase.BulkLoadAsKeyValue3 --driver-cores 8 --driver-memory 11g --executor-cores 4 --executor-memory 9g /home/myuser/app.jar

When I look into the output directory of my HDFS, it is still empty! I'm using Spark 1.6.3 in a HDP 2.5 platform.

So I have two questions here: Where comes this behavior from (maybe memory problems)? And what is the difference between the yarn-client and yarn-cluster mode (I didn't understand it yet, also the documentation isn't clear to me)? Thanks for your help!

2

2 Answers

1
votes

It seems that job doesn't start. Before start the job Spark check available resources. I think available resources are not enough. So try to reduce driver and executor memory, driver and executor cores in your configuration. Here you can read how to calculate opportune value of resources for executors and driver: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Your job runs in client mode because in client mode drive can use all available resources on the node. But in cluster mode resources are limited.

Difference between cluster and client mode:
Client:

Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.

Cluster:

Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.
0
votes

I found out, that this problem is related to a Kerberos issue! When running the application in yarn-client mode from my Hadoop Namenode the driver is running on that node, where also my Kerberos server is running on. Therefore, the used userpricipal in file /etc/security/keytabs/user.keytab is present on this machine.

When running the app in yarn-cluster, the driver process is started randomly on one of my Hadoop nodes. As I forgot to copy the keyfiles to the other nodes after creating them, the driver processes of course coun't find the keytab file on that local location!

So, to be able to work with Spark in a Kerberized Hadoop Cluster (and even in yarn-cluster mode), you have to copy the needed keytab files of the user who runs the spark-submit command to the corresponding path on all nodes of the cluster!

scp /etc/security/keytabs/user.keytab user@workernode:/etc/security/keytabs/user.keytab

So you should be able to make a kinit -kt /etc/security/keytabs/user.keytab user on each node of the cluster.