1
votes

Problem : Regular expression not working as expected for HBase scan filter. Although this RegEx passes without any error it doesn't return filtered rows only.

Background : We are storing our data in HBase as string (I know it should have been in Avro but need to work with this now.)

Our HBase column DataRows look something like below, pipe is used as delimiter.

NAME|10000081|10000102|13513|10102026|GENDER|ID NAME|10000081|10000101|13513|10102026|GENDER|ID NAME|10000081|10000103|13513|10102026|GENDER|ID NAME|10000082|10000104|13515|10102026|GENDER|ID NAME|10000082|10000104|13516|10102026|GENDER|ID

I am writing a RegEx filter for the HBase scanner which will scan these rows.

My RegEx string looks like this :

^NAME\\|.*\\|.*\\|.*\\|.*\\|.*\\|.*$

This is input for HBase QualifierFilter, e.g

Filter qfilter = new QualifierFilter(CompareFilter.CompareOp.EQUAL,new RegexStringComparator(regexString.toString()));

In above example for regex string (I want to filter only Name=RECKO and 3rd col = 10000101). It returns all rows.

Regex String = ^NAME\\|.*\\|10000101\\|.*\\|.*\\|.*\\|.*$

What is wrong with my regular expression? Any pointers/suggestions are appreciated very much.

Test Program:

 public class RegEx1 {
      public static void main(String[] args) {
        String Str[] = {
        "PC|10000081|10000102|13513|10102026|LOC|ic",
        "PC|10000081|10000101|13512|10102025|LOC|zc",
    "NAME|10000042|10000084|13576|10101626|GENDER|cc",
    "NAME|10000042|10000084|13576|10101626|GENDER|za",
    "NAME|10000042|10000084|13576|10101626|GENDER|zc",
    "NAME|10000061|10000086|13581|10101630|GENDER|ic",
    "NAME|10000061|10000086|13581|10101630|GENDER|za",
    "NAME|10000061|10000086|13581|10101630|GENDER|zc",
    "NAME|10001076|10001744|15106|10123669|GENDER|cc",
    "NAME|10001076|10001744|15106|10123669|GENDER|za",
    "NAME|10001076|10001744|15106|10123669|GENDER|zc",
    "NAME|10000061|10000086|13581|10101630|GENDER|ic",
    "NAME|10000061|10000086|13581|10101630|GENDER|za",
    "NAME|10000061|10000086|13581|10101630|GENDER|zc",
    "NAME|10001075|10001743|15105|10123664|GENDER|ic",
    "NAME|10001075|10001743|15105|10123664|GENDER|za",
    "NAME|10001075|10001743|15105|10123664|GENDER|zc",
    "NAME|10001077|10001745|15239|10123673|GENDER|cc",
    "NAME|10001077|10001745|15239|10123673|GENDER|za",
    "NAME|10001077|10001745|15239|10123673|GENDER|zc",
    "NAME|10002165|10000102|10151364|10151363|GENDER|ic",
    "NAME|10002165|10003668|10151364|10151363|GENDER|za",
    "NAME|10002165|10003668|10151364|10151363|GENDER|zc",
    "NAME|10002167|10003670|10151368|10151367|GENDER|cc",
"NAME|10002167|10003670|10151368|10151367|GENDER|zb"    };

    for (String s : Str){
        System.out.println(s);
        System.out.println(s.matches("^NAME\\|10002167\\|.*\\|.*\\|.*\\|*$"));
    }



   }
 } 

For above program I get all input values as matches, actually it should match only strings where first column = "NAME" and 2nd column is 10002167.

Update : Thanks to @Aviram Segal. After correcting regex it works in Java test program but not in HBase scan filter.

2
What is the problem? What's the output?Michael Myers
It doesn't work as expected. Returns all rows. It works only when first regex values is given.user300313

2 Answers

2
votes

Your forgot to escape one | character so it is treated as an OR, also you can use [|] instead of \\|, personally I like that style better.

Yours: ^NAME\\|.*\\|10000101|.*\\|.*\\|.*\\|.*$

Fixed: ^NAME\\|.*\\|10000101\\|.*\\|.*\\|.*\\|.*$


Yours: System.out.println(s.matches("^NAME\\|10002167|.*\\|.*\\|.*\\|*$"));

Fixed: System.out.println(s.matches("^NAME\\|10002167\\|.*\\|.*\\|.*\\|*$"));

0
votes

. represent any character hence the issue. Try using \w for word characters in place of ..