What Happened
All the data from last month was corrupted due to a bug in the system. So we have to delete and re-input these records manually. Basically, I want to delete all the rows inserted during a certain period of time. However, I found it difficult to scan and delete millions of rows in HBase.
Possible Solutions
I found two way to bulk delete:
The first one is to set a TTL, so that all the outdated record would be deleted automatically by the system. But I want to keep the records inserted before last month, so this solution does not work for me.
The second option is to write a client using the Java API:
public static void deleteTimeRange(String tableName, Long minTime, Long maxTime) {
Table table = null;
Connection connection = null;
try {
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
connection = HBaseOperator.getHbaseConnection();
table = connection.getTable(TableName.valueOf(tableName));
ResultScanner rs = table.getScanner(scan);
List<Delete> list = getDeleteList(rs);
if (list.size() > 0) {
table.delete(list);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (null != table) {
try {
table.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (connection != null) {
try {
connection.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
private static List<Delete> getDeleteList(ResultScanner rs) {
List<Delete> list = new ArrayList<>();
try {
for (Result r : rs) {
Delete d = new Delete(r.getRow());
list.add(d);
}
} finally {
rs.close();
}
return list;
}
But in this approach, all the records are stored in ResultScanner rs
, so the heap size would be huge. And if the program crushes, it has to start from the beginning.
So, is there a better way to achieve the goal?