I'm using Spark and trying to write the RDD to the HBase table.
Here the sample code:
public static void main(String[] args) {
// ... code omitted
JavaPairRDD<ImmutableBytesWritable, Put> hBasePutsRDD = rdd
.javaRDD()
.flatMapToPair(new MyFunction());
hBasePutsRDD.saveAsNewAPIHadoopDataset(job.getConfiguration());
}
private class MyFunction implements
PairFlatMapFunction<Row, ImmutableBytesWritable, Put> {
public Iterable<Tuple2<ImmutableBytesWritable, Put>> call(final Row row)
throws Exception {
List<Tuple2<ImmutableBytesWritable, Put>> puts = new ArrayList<>();
Put put = new Put(getRowKey(row));
String value = row.getAs("rddFieldName");
put.addColumn("CF".getBytes(Charset.forName("UTF-8")),
"COLUMN".getBytes(Charset.forName("UTF-8")),
value.getBytes(Charset.forName("UTF-8")));
return Collections.singletonList(
new Tuple2<>(new ImmutableBytesWritable(getRowKey(row)), put));
}
}
If I manually set the timestamp like this:
put.addColumn("CF".getBytes(Charset.forName("UTF-8")),
"COLUMN".getBytes(Charset.forName("UTF-8")),
manualTimestamp,
value.getBytes(Charset.forName("UTF-8")));
everything works fine and I have as many cell versions in HBase column "COLUMN" as there are number of different values in RDD.
But if I do not, there is only one cell version.
In another words, if there are multiple Put objects with the same column family and column, different values and default timestamp, the only one value will be inserted and another will be omitted (maybe overwritten).
Could you please help me understand how it works (saveAsNewAPIHadoopDataset especially) in this case and how can I modify the code to insert values and do not a timestamp manually.