I am new to spark and I am trying to get my facebook data from HBASE table with following schema:
I want to do a spark job on it as explained below. Following is my code to get the JavaPairRDD.
SparkConf sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]");
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryoserializer.buffer.mb", "256");
sparkConf.set("spark.kryoserializer.buffer.max", "512");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "localhost:2181");
conf.set("hbase.regionserver.port", "60010");
String tableName = "fbData";
conf.set("hbase.master", "localhost:60010");
conf.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
Now using map() of RDD I am able to get the JavaRDD for posts/comments/replies using type column:
JavaRDD<Post> results = hBaseRDD.map(new Function<Tuple2<ImmutableBytesWritable, Result>, Post>() {
//fetching posts
return post;
}
Now I have 3 JavaRDDs for posts, comments and replies. POJO Post has fields for comments and replies. So I want to add the comments and Replies to the post using parent post Id. How can I accomplish this with Spark? Way that I thought of was to iterate through all posts, then iterate through all the comments and replies. Thanks in advance.