1
votes

I'm writing a mapreduce job over HBase using table mapper. I want to skip rows that don't have specific columns. For example, if the mapper reads from the "meta" family, "source" qualifier column, the mapper should expect something to be in that column. I know I can add columns to the scan object, but I expect this merely limits which rows can be seen by the scan, not which columns need to be there.

What filter can I use to skip rows without the columns I need?

Also, the filter concept itself is a little strange. Does the filter operate on a row-by-row basis or a keyvalue-by-keyvalue basis? Does "filter a row" mean skip the row or include it, or simply put it through a filter?

Is there somewhere where this is explained more clearly than the hbase javadocs?

2

2 Answers

2
votes
//to skip columns with Column Prefix
Filter columnFilter = new ColumnPrefixFilter(Bytes.toBytes("col-1"));
 //To skip the values
Filter valueFilter= new ValueFilter(CompareFilter.CompareOp.NOT_EQUAL,
      new BinaryComparator(Bytes.toBytes("yourvalue")));

 To Avoid the multiple column names you can pass multiple column filter with must pass all option(which is default)
Below is sample with single column filter.

Filter avoidColumnNamesFilter = new SkipFilter(columnFilter);
scan.setFilter(avoidColumnNamesFilter)
Similarly to avoid certain value pass valuefilter to skip filter
0
votes

The HBase book is the best place to answer a large number of questions: http://hbase.apache.org/book/client.filter.html in particular explains how filters work.

Filters are very efficient as they are performed on the server side and reduce the amount of data flowing over the network. I agree that the javadocs really makes the semantics of include or exclude non-obvious, but I think the book makes it clear: Filters define what must be true in order to return the row to the client.

Scans are also a good way to defining what must be returned, although you need to be careful in how you define your scans. If you define a scan to contain a whole column family in one api call, and then later in your code, define a specific column qualifier to be returned, the second call will override the first call and only that specific qualifier will be returned, and no other column qualifier in the column family will be returned.