I have a huge table in hbase, with potentially millions of rows.
[HBase Table Structure][1]
I am trying to access chunks of the table, using (STARTROW & ENDROW) and the sc.newAPIHadoopRDD
function.
I am trying to find a way to get the Column Qualifier names from the resulting RDD. As each row can have any number of columns and column qualifiers, I want to get the column families of each Row, by Rowkey.
In short, I want to create a Dataframe in Spark which looks somewhat like this :
ROWKEY COLUMN NAME VALUE
ROW1 ColumnFamily:ColumnQualifier1 Value="XX"
ROW1 ColumnFamily:ColumnQualifier2 Value="XX"
ROW1 ColumnFamily:ColumnQualifier3 Value="XX"
ROW1 ColumnFamily:ColumnQualifier4 Value="XX"
ROW1 ColumnFamily:ColumnQualifier5 Value="XX"
ROW1 ColumnFamily:ColumnQualifier6 Value="XX"
ROW2 ColumnFamily:ColumnQualifier1 Value="XX"
ROW2 ColumnFamily:ColumnQualifier2 Value="XX"
ROW2 ColumnFamily:ColumnQualifier3 Value="XX"
ROW2 ColumnFamily:ColumnQualifier4 Value="XX"
ROW3 ColumnFamily:ColumnQualifier1 Value="XX"
ROW4 ColumnFamily:ColumnQualifier2 Value="XX"
So, from the RDD returned by sc.newAPIHadoopRDD
, I want to know of a way to access the column names.
Once I have the column qualifier, I can get value for family : qualifier combination using the rdd.getValue(family,qualifier)
function.
val kvRDD = sc.newAPIHadoopRDD(hbaseConf,classOf[TableInputFormat],classOf[ImmutableBytesWritable], classOf[Result])
val resultRDD = kvRDD.map(tuple => tuple._2)
val keyValueRDD = resultRDD.map(result => {
var resultStrings: List[Object] = List()
var navigablemap=result.getNoVersionMap()
val vallist = navigablemap.values()
for (each <- vallist) {
resultStrings = resultStrings ::: List(each)
}
resultStrings
})
But this is returning an rdd in which each row is encrypted. Any help with the scala code is greatly appreciated, Thanks