HBase scanning through shell command and mapreduce gives two different rersult

Question

I have HBase table and it has more than billion records. When I query scan the HBase table with certain ValueFilter I get 41820 records, but it took more than 35 mins to give the result but when I used mapreduce program to scan the same HBase table, I got the count with in 2 mins but gave me 41035 recods. I don't know.

Here is the shell command I use :

scan 'permhistory', { COLUMNS => 'h:e_source', FILTER => "ValueFilter( =, 'binaryprefix:AC_B2B' )" }

result : 41820

Here is the Scan object in mapreduce :

    Scan scan = new Scan();
    scan.setCaching(2000);
    scan.setCacheBlocks(false);
    scan.addFamily(Bytes.toBytes("h"));
    scan.addColumn(Bytes.toBytes("h"), Bytes.toBytes("e_source"));
    SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes("h"),
                    Bytes.toBytes("e_source"),CompareOp.EQUAL,Bytes.toBytes("AC_B2B"));
    filter.setLatestVersionOnly(false);
    scan.setFilter(filter);

Any idea? This is my first post on here. Experts out there, would you please help me out? I am kind of stuck on automating our system

Averman Averman · Accepted Answer · 2014-10-03T06:30:45

In the mapreduce you are using this constructor

public SingleColumnValueFilter(byte[] family,
                       byte[] qualifier,
                       CompareFilter.CompareOp compareOp,
                       byte[] value)

It means you instantiate a Filter using default comparator, but in the hbase shell you're using

"ValueFilter( =, 'binaryprefix:AC_B2B' )"

A binaryprefix comparator, so you should try this instead

SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes("h"),
                    Bytes.toBytes("e_source"),
                    CompareOp.EQUAL,
                    new BinaryPrefixComparator(Bytes.toBytes("AC_B2B")));

Moreover, in the hbase shell you are using ValueFilter and in the mapreduce you are using SingleColumnValueFilter. For your reference:

SingleColumnValueFilter

This filter is used to filter cells based on value. It takes a CompareFilter.CompareOp operator (equal, greater, not equal, etc), and either a byte [] value or a ByteArrayComparable. If we have a byte [] value then we just do a lexicographic compare. For example, if passed value is 'b' and cell has 'a' and the compare operator is LESS, then we will filter out this cell (return true). If this is not sufficient (eg you want to deserialize a long and then compare it to a fixed long value), then you can pass in your own comparator instead.

You must also specify a family and qualifier. Only the value of this column will be tested. When using this filter on a Scan with specified inputs, the column to be tested should also be added as input (otherwise the filter will regard the column as missing).

To prevent the entire row from being emitted if the column is not found on a row, use setFilterIfMissing(boolean). Otherwise, if the column is found, the entire row will be emitted only if the value passes. If the value fails, the row will be filtered out.

ValueFilter

This filter is used to filter based on column value. It takes an operator (equal, greater, not equal, etc) and a byte [] comparator for the cell value.

In this case, since you're specifically set the column to be scanned, it will act the same way. The ValueFilter will filter all column, and the SingleColumnValueFilter will only filter a specific column and omit the row altogether if it doesn't pass the filter.

HBase scanning through shell command and mapreduce gives two different rersult

1 Answers