Getting data from HBase using SparkStreaming with Kafka

Question

I am trying to integrate SparkStreaming with HBase. I am calling following APIs to connect to HBase:

HConnection hbaseConnection = HConnectionManager.createConnection(conf);
hBaseTable = hbaseConnection.getTable(hbaseTable);

Since I cannot get the connection and broadcast the connection each API call to get data from HBase is very expensive. I tried using JavaHBaseContext (JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf)) by using hbase-spark library in CDH 5.5 but I cannot import the library from maven. Has anyone been able to successfully resolve this issue.

I am trying to use the latest APIs to connect HBase and SparkStreaming on Cloudera.

Some of the JIRA items mentioned here.

http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

I am using JavaHBaseContext hbaseContext = new JavaHBaseContext(jssc.sparkContext(), conf); Then called bulk Get API hbaseContext.streamBulkGet(TableName.valueOf(tableName), 2, lines, new GetFunction2(), new ResultFunction());

But this bulk API is invoked during initialization not during each streaming message. Also used:

hbaseContext.foreachPartition(jDStream,new VoidFunction<Tuple2<Iterator<String>, Connection>>() {
      public void call(Tuple2<Iterator<String>, Connection> t)throws Exception { ...}

The API exists but somehow it is not working for streaming message. Also tried hbaseContext.streamMap(jdstream, new Function<Tuple2<Iterator<String>, Connection>, Iterator<String>>() but it is not working either.

Do we have an example of how to get data using the spark streaming API.

I cannot pull the maven repository mentioned here. hbase.apache.org/hbase-spark/dependency-info.html — Alchemist

Grant Grant · Accepted Answer · 2016-03-09T16:12:05

Where do you set up the connection? If your connection code is only on the driver, make sure the connection object is serializable

I am using Cassandra, what I did is that I have a scala object where I have cassandra connection object. In this way, on both driver and workers, at least there is one exeuctor-scoped connection object

Getting data from HBase using SparkStreaming with Kafka

1 Answers