HBase bulk delete using MapReduce job

Question

Using a mapreduce job I am trying to delete rows from a Hbase table.

I am getting the following error.

java.lang.ClassCastException: org.apache.hadoop.hbase.client.Delete cannot be cast to org.apache.hadoop.hbase.KeyValue
        at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat$1.write(HFileOutputFormat.java:124)
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:551)
        at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
        at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:99)
        at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:144)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.

It looks like this is caused by the output set to KeyValue by configureIncrementalLoad. It only has PutSortReducer and KeyValueSortReducer but not a DeleteSortReducer.

My Code:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class DeleteRows extends Configured implements Tool {

    public static class Map extends
            Mapper<LongWritable, Text, ImmutableBytesWritable, Delete> {

        ImmutableBytesWritable hKey = new ImmutableBytesWritable();
        Delete delRow;

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            hKey.set(value.getBytes());
            delRow = new Delete(hKey.get());
            context.write(hKey, delRow);
            // Update counters
            context.getCounter("RowsDeleted", "Success").increment(1);
        }
    }


    @SuppressWarnings("deprecation")
    public int run(String[] args) throws Exception {
        Configuration conf = new Configuration();
        args = new GenericOptionsParser(conf, args).getRemainingArgs();
        HBaseConfiguration.addHbaseResources(conf);

        Job job = new Job(conf, "Delete stuff!");
        job.setJarByClass(DeleteRows.class);

        job.setMapperClass(Map.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Delete.class);

        job.setInputFormatClass(TextInputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));

        HTable hTable = new HTable(args[2]);
        // Auto configure partitioner and reducer
        HFileOutputFormat.configureIncrementalLoad(job, hTable);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);
        return (0);
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new DeleteRows(), args);
        System.exit(exitCode);
    }
}

Is there a better / faster way to delete a large number of rows using their row keys? Obviously deleting each row in a mapper is possible however I would imagine that is slower than bulk pushing deletes to the correct region server.

Roman Nikitchenko Roman Nikitchenko · Accepted Answer · 2014-04-25T23:26:17

Your goal is to generate HFile with Delete stream (actually deleting marks as KeyValue) inside. And standard way to do so is to use HFileOutputFormat. Actually you can only place stream of KeyValue changes into this format and there is 2 standard reducers: PutSortReducer and KeyValueSortReducer. Setting number of reduce tasks to 0 you actually pass all Delete directly to output format which of course cannot work.

Your most obvious options:

Add your reducer DeleteSortReducer. Such reducers are pretty simple and you can just almost copy. You need only to extract individual KeyValue stream from Delete and sort them. PutSortReducer is good example for you. Put changes are not sorted so this is why such reducer is needed.
Just construct not stream of Delete but stream of appropriate KeyValue containing delete marks. This is maybe best thing for speed.

HBase bulk delete using MapReduce job

2 Answers