0
votes

I'm using Spark and trying to write the RDD to the HBase table.

Here the sample code:

public static void main(String[] args) {
// ... code omitted
    JavaPairRDD<ImmutableBytesWritable, Put> hBasePutsRDD = rdd
            .javaRDD()
            .flatMapToPair(new MyFunction());

    hBasePutsRDD.saveAsNewAPIHadoopDataset(job.getConfiguration());
}

private class MyFunction implements
            PairFlatMapFunction<Row, ImmutableBytesWritable, Put> {

    public Iterable<Tuple2<ImmutableBytesWritable, Put>> call(final Row row) 
            throws Exception {

        List<Tuple2<ImmutableBytesWritable, Put>> puts = new ArrayList<>();
        Put put = new Put(getRowKey(row));
        String value = row.getAs("rddFieldName");

        put.addColumn("CF".getBytes(Charset.forName("UTF-8")), 
                      "COLUMN".getBytes(Charset.forName("UTF-8")),
                      value.getBytes(Charset.forName("UTF-8")));

        return Collections.singletonList(
            new Tuple2<>(new ImmutableBytesWritable(getRowKey(row)), put));
    }
}

If I manually set the timestamp like this:

put.addColumn("CF".getBytes(Charset.forName("UTF-8")), 
              "COLUMN".getBytes(Charset.forName("UTF-8")),
              manualTimestamp,
              value.getBytes(Charset.forName("UTF-8")));

everything works fine and I have as many cell versions in HBase column "COLUMN" as there are number of different values in RDD.

But if I do not, there is only one cell version.

In another words, if there are multiple Put objects with the same column family and column, different values and default timestamp, the only one value will be inserted and another will be omitted (maybe overwritten).

Could you please help me understand how it works (saveAsNewAPIHadoopDataset especially) in this case and how can I modify the code to insert values and do not a timestamp manually.

1

1 Answers

3
votes

They are overwritten when you don't use your timestamp. Hbase needs a unique key for every value, so real key for every value is

rowkey + column family + column key + timestamp => value

When you don't use timestamp, and they are inserted as bulk, many of them get same timestamp as hbase can insert multiple rows in same millisecond. So you need a custom timestamp for every same column key values.

I did not understand why you did not want to use custom timestamp as you said it works already. If you think it will use extra space in database, hbase already use timestamp even if you don't give in Put command. So nothing changes when you use manual timestamp, please use it.