0
votes

spark read data from hbase,such as //create rdd

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], 
  classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
  classOf[org.apache.hadoop.hbase.client.Result])

for example, hBaseRDD has 5 partitions, now executor on worker get partition data to compute, they must get data from remote driver program? (not like read from hdfs, each worker as hadoop slave has the hdfs file replication)

1

1 Answers

0
votes

Spark is integrated with HBase and data locality principles are the same as in Hadoop map-reduce jobs: spark will try to assign input partition (hbase region) to the worker on the same physical machine, so data will be fetched directly without remote driver.