1
votes

So my project flow is Kafka -> Spark Streaming ->HBase

Now I want to read data again from HBase which will go over the table created by the previous job and do some aggregation and store it in another table in different column format

Kafka -> Spark Streaming(2ms)->HBase->Spark Streaming (10ms)->HBase

Now I don't know how to read data from HBase using Spark Streaming. I found a Cloudera Lab Project that is SparkOnHbase(http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/) library, but I can't figure out how to get a inputDStream for stream processing from HBase. Please provide any pointers or library links if there are any which will help me do this.

2

2 Answers

0
votes

You can create DStream from Queue of RDDs using queueStream: StreamingContext

JavaSparkContext sc = new JavaSparkContext(conf);
org.apache.hadoop.conf.Configuration hconf = HBaseConfiguration.create();
JavaHBaseContext jhbc = new JavaHBaseContext(sc, hconf);
Scan scan1 = new Scan();           
scan1.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME, tableName.getBytes());

// Create RDD
         rdd = jhbc.hbaseRDD(tableName, scan1, new Function<Tuple2<ImmutableBytesWritable, Result>, Tuple2<ImmutableBytesWritable, Result>>() {
            @Override
            public Tuple2<ImmutableBytesWritable, Result> call(Tuple2<ImmutableBytesWritable, Result> immutableBytesWritableResultTuple2) throws Exception {
                return immutableBytesWritableResultTuple2;
            }
        });

   // Create streaming context and queue
   JavaSparkStreamingContext ssc = new JavaSparkStramingContext(sc);

   Queue<JavaRDD<Tuple2<ImmutableBytesWritable, Result> >> queue =new Queue<JavaRDD<Tuple2<ImmutableBytesWritable, Result>>>( );
        queue.enqueue(rdd);

JavaDStream<Tuple2<ImmutableBytesWritable, Result>> ssc.queueStream(queue);

PS: You could just use Spark(without streaming for that)