2
votes

If I read some data from an HBase (or MapR-DB) table with

JavaPairRDD<ImmutableBytesWritable, Result> usersRDD = sc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);

the resulting RDD has 1 partition, as I can see calling usersRDD.partitions().size(). Using something like usersRDD.repartition(10) is not viable, as Spark complains because ImmutableBytesWritable is not serializable.

Is there a way to make Spark create a partitioned RDD from HBase data?

1

1 Answers

1
votes

Number of Spark partitions when using org.apache.hadoop.hbase.mapreduce.TableInputFormat depends on the number of regions of HBase table - in your case it's 1 (the default). Have a look at my answer to a similar question for more details.