0
votes

I have two version of python script that scans the table in hbase by 1000 rows in while loop. 1st one using happybase as in https://happybase.readthedocs.org/en/latest/user.html#retrieving-rows

while variable:
    for key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000):
        print key
    new_key = key

the 2nd one using hbase thrift interface as in http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/

scanner_id = hbase.scannerOpenWithStop(tablename, '', '', [])
data = hbase.scannerGetList(scanner_id, 1000) 
while len(data):
    for dbpost in data:
        print row_of_dbpost
    data = hbase.scannerGetList(scanner_id, 1000)

rows in database are numbers. so my problem is that in certain row something weird is happening:

happybase prints(rows):

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest) 
193622937692155904 
193623435597983745...

and thrift_scanner prints(rows):

... 100161632382107648 
10016177552 
10016186396 
10016200693 
10016211838 
100162138374537217 (point of interest)
100162267416506368 
10016241167 
10016296927 ...

and this is happening not in the point of next 1000 rows (row_start=new_scan or next data=scannerGetList), but in the middle of batch. And it happens every time.

I would say that 2nd script with scannerGetList is doing it right.

Why happybase doing it differently? is it considering timestamps or some other inside happybase/hbase logic? will it eventually scan the whole table, just in different order?

ps. i do know that happybase version will scan and print 1000th row two times, and scannerGetList will ignore the first row in next data. that is not the point, magic is happening in the middle of 1000 row batch.

1

1 Answers

3
votes

I'm not sure about your data, but those loops are not identical. Your Thrift version uses only a single scanner, while your Happybase example repeatedly creates a new scanner. Also, your Happybase version imposes a scanner limit, while your Thrift version does not.

With Thrift you need to do bookkeeping, and you will need duplicate code (the scannerGetList() call) for the loop, so perhaps that's causing your confusion.

The right approach with Happybase would simply be this:

table = connection.table(tablename)
for key, data in table.scan(row_start=new_key, batch_size=1000):
    print key
    if some_condition:
        break  # this will cleanly close the scanner

Note: no nested loops here. Other benefit is that Happybase will properly close the scanner when you're done with it, while your Thrift version does not.