0
votes

I have an RDD that contains HBase row keys. The RDD is relatively large to fit in memory. I need to get an RDD of values for each of the provided key. Is there a way to do something like this:

keys.map(key => table.get(new Get(key)))

So the question is how can I obtain an instance of HTable inside map task? Should I instantiate an HConnection for every partition, and then obtain HTable instance from it, or is there a better way?

1

1 Answers

0
votes

There are a few options you can can do but first consider the fact that spark does not allow you to create RDDs of RDDs. So really that leaves you with two options

  1. a list of RDDs
  2. A Key/value RDD

I would highly recommend the second one as a list of RDDs could end with you needing to perform a lot of reduces which could massively increase the number of shuffles you need to perform. With that in mind I would recommend you use a flatMap.

So here is some basic skeleton code that could get you that result

val input:RDD[String]
val completedRequests:RDD[(String, List[String]) = input.map(a => (a, table.get(new Get(a)))
val flattenedRequests:RDD[(String, String) = completedRequests.flatMap{ case(k,v) => v.map(b =>(k,b))

You can now handle the RDD as one object, reduceByKey if you have a particular piece of information you need from it, and now spark will be able to access the data with optimal parallelism.

Hope that helps!