0
votes

I've a requirement with deleting the data from Hbase. I want to delete the latest version of each cell based on the row key in Hbase. I thought of an approach to get the column names and latest timestamp of each column with the given rowkey.....then perform the delete operation iteratively with each column and its time stamp.

But I'm not able to get the column names, so I'm not able do it.

Please share if you have any thoughts or working code ?

2

2 Answers

0
votes

Here is a custom filter I made once, GetLatestColumnsFilter, which can be used to get the columns with the latest timestamp and I think can be used to solve your problem.

public class GetLatestColumnsFilter extends TimestampsFilter {
    private long max;

    public GetLatestColumnsFilter() {
        super(new ArrayList<>());
        max = -1;
    }

    @Override
    public ReturnCode filterKeyValue(Cell v) {
        if (-1 == max) {
            max = Long.valueOf(v.getTimestamp());
        } else if (max != Long.valueOf(v.getTimestamp())) {
            return ReturnCode.SKIP;
        }
        return ReturnCode.INCLUDE;
    }

    public static GetLatestColumnsFilter parseFrom(byte[] pbBytes) throws DeserializationException {
        return new GetLatestColumnsFilter();
    }

}
0
votes

From HBase official guide for version 0.94, you can see that:

Deletes work by creating tombstone markers. For example, let's suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is “delete all cells where the version is less than or equal to this version”. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values[17]. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

So I don't see the problem with following the standard Delete procedure.

However, if you want to delete only the latest versions of your cells you could use the setTimestamp method of Scan class. So, what you could do is:

List<Delete> deletes = new ArrayList<>();
Scan scan = new Scan();
scan.setTimestamp(latestVersionTimeStamp); //latestVersionTimeStamp is a long variable
//set your filters here
ResultScanner rscanner = table.getScanner(scan);
for(Result rs : rscanner){
    deletes.add(new Delete(rs.getRow()));
}
try{
    table.delete(deletes);
}
catch(Exception e){
    e.printStackTrace();
}

However, if your Time Stamp isn't the same across cells, this will not work for all of them. This probably will.

List<Delete> deletes = new ArrayList<>();
ArrayList<long> timestamps =  new ArrayList<>();//your list of timestamps
Delete d;
Scan scan = new Scan();
//set your filters here
ResultScanner rscanner = table.getScanner(scan);
for(Pair<Result, long> item : zip(rscanner, timestamps)){
    d=new Delete(item.getLeft().getRow())
    d.setTimestamp(item.getRight());
    deletes.add(d);
}
try{
    table.delete(deletes);
}
catch(Exception e){
    e.printStackTrace();
}

I don't guarantee this will work, however. The official guides are vague enough and I might have misinterpreted anything. If I did indeed misinterpret, alert me and I will delete this answer.

WHERE I SOURCED MY INFORMATION setTimestamp method for Scan class and setTimestamp method for Delete class