0
votes

I am new to spark and I am trying to get my facebook data from HBASE table with following schema:

FB data schema

I want to do a spark job on it as explained below. Following is my code to get the JavaPairRDD.

    SparkConf sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]");
    sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
    sparkConf.set("spark.kryoserializer.buffer.mb", "256");
    sparkConf.set("spark.kryoserializer.buffer.max", "512");
    JavaSparkContext sc = new JavaSparkContext(sparkConf);
    Configuration conf = HBaseConfiguration.create();
    conf.set("hbase.zookeeper.quorum", "localhost:2181");
    conf.set("hbase.regionserver.port", "60010");
    String tableName = "fbData";
    conf.set("hbase.master", "localhost:60010");
    conf.set(TableInputFormat.INPUT_TABLE, tableName);
    JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class,
            ImmutableBytesWritable.class, Result.class);

Now using map() of RDD I am able to get the JavaRDD for posts/comments/replies using type column:

JavaRDD<Post> results = hBaseRDD.map(new Function<Tuple2<ImmutableBytesWritable, Result>, Post>() {
   //fetching posts 
   return post;
}

Now I have 3 JavaRDDs for posts, comments and replies. POJO Post has fields for comments and replies. So I want to add the comments and Replies to the post using parent post Id. How can I accomplish this with Spark? Way that I thought of was to iterate through all posts, then iterate through all the comments and replies. Thanks in advance.

1

1 Answers

1
votes

One way you can do this is by making your 3 RDDs JavaPairRDDs, with the fields in comment as the key. You can then use the join method.

Assuming that the results and comments RDD are pair RDDs then you can just do:

JavaPairRDD<??> aggregatedResults = results.join(comments)

I do not know what type you would use for the combined objects.