I wrote a Spark application, that reads some CSV files (~5-10 GB), transforms the data and converts the data into HFiles. The data is read from and saved into HDFS.
Everything seems to work fine when I run the application in the yarn-client
mode.
But when I try to run it as yarn-cluster
application, the process seems not to run the final saveAsNewAPIHadoopFile
action on my transformed and ready-to-save RDD!
Here is a snapshot of my Spark UI, where you can see that all the other Jobs are processed:
And the corresponding Stages:
Here the last step of my application where the saveAsNewAPIHadoopFile
method is called:
JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ...
try {
Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab");
Configuration baseConf = c.getConfiguration();
baseConf.set("hbase.zookeeper.quorum", HBASE_HOST);
baseConf.set("zookeeper.znode.parent", "/hbase-secure");
Job job = Job.getInstance(baseConf, "Test Bulk Load");
HTable table = new HTable(baseConf, "map_data");
HBaseAdmin admin = new HBaseAdmin(baseConf);
HFileOutputFormat2.configureIncrementalLoad(job, table);
Configuration conf = job.getConfiguration();
cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf);
System.out.println("Finished!!!!!");
} catch (IOException e) {
e.printStackTrace();
System.out.println(e.getMessage());
}
I'm running the appliaction via spark-submit --master yarn --deploy-mode cluster --class sparkhbase.BulkLoadAsKeyValue3 --driver-cores 8 --driver-memory 11g --executor-cores 4 --executor-memory 9g /home/myuser/app.jar
When I look into the output directory of my HDFS, it is still empty! I'm using Spark 1.6.3 in a HDP 2.5 platform.
So I have two questions here: Where comes this behavior from (maybe memory problems)? And what is the difference between the yarn-client and yarn-cluster mode (I didn't understand it yet, also the documentation isn't clear to me)? Thanks for your help!