Spark: Mapping an RDD of HBase row keys to an RDD of values

Question

I have an RDD that contains HBase row keys. The RDD is relatively large to fit in memory. I need to get an RDD of values for each of the provided key. Is there a way to do something like this:

keys.map(key => table.get(new Get(key)))

So the question is how can I obtain an instance of HTable inside map task? Should I instantiate an HConnection for every partition, and then obtain HTable instance from it, or is there a better way?

Daniel Imberman Daniel Imberman · Accepted Answer · 2016-01-21T16:08:19

There are a few options you can can do but first consider the fact that spark does not allow you to create RDDs of RDDs. So really that leaves you with two options

a list of RDDs
A Key/value RDD

I would highly recommend the second one as a list of RDDs could end with you needing to perform a lot of reduces which could massively increase the number of shuffles you need to perform. With that in mind I would recommend you use a flatMap.

So here is some basic skeleton code that could get you that result

val input:RDD[String]
val completedRequests:RDD[(String, List[String]) = input.map(a => (a, table.get(new Get(a)))
val flattenedRequests:RDD[(String, String) = completedRequests.flatMap{ case(k,v) => v.map(b =>(k,b))

You can now handle the RDD as one object, reduceByKey if you have a particular piece of information you need from it, and now spark will be able to access the data with optimal parallelism.

Hope that helps!

Spark: Mapping an RDD of HBase row keys to an RDD of values

1 Answers