I have two version of python script that scans the table in hbase by 1000 rows in while loop. 1st one using happybase as in https://happybase.readthedocs.org/en/latest/user.html#retrieving-rows
while variable:
for key, data in hbase.table(tablename).scan(row_start=new_key, batch_size=1000, limit=1000):
print key
new_key = key
the 2nd one using hbase thrift interface as in http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/
scanner_id = hbase.scannerOpenWithStop(tablename, '', '', [])
data = hbase.scannerGetList(scanner_id, 1000)
while len(data):
for dbpost in data:
print row_of_dbpost
data = hbase.scannerGetList(scanner_id, 1000)
rows in database are numbers. so my problem is that in certain row something weird is happening:
happybase prints(rows):
... 100161632382107648
10016177552
10016186396
10016200693
10016211838
100162138374537217 (point of interest)
193622937692155904
193623435597983745...
and thrift_scanner prints(rows):
... 100161632382107648
10016177552
10016186396
10016200693
10016211838
100162138374537217 (point of interest)
100162267416506368
10016241167
10016296927 ...
and this is happening not in the point of next 1000 rows (row_start=new_scan or next data=scannerGetList), but in the middle of batch. And it happens every time.
I would say that 2nd script with scannerGetList is doing it right.
Why happybase doing it differently? is it considering timestamps or some other inside happybase/hbase logic? will it eventually scan the whole table, just in different order?
ps. i do know that happybase version will scan and print 1000th row two times, and scannerGetList will ignore the first row in next data. that is not the point, magic is happening in the middle of 1000 row batch.