Python Mapreduce running on duplicate data read from hbase

Question

I have written a mapreduce code using python streaming(having only mapper function implementation) and used happybase to read from Hbase. When I am running a mapreduce code in 5 node distribution since the python streaming code is having a scan function which reads records from hbase and is distributed throughout the cluster all the mapper instances created are processing on the same data sets extracted/read from the hbase .

example:

for key, data in table.scan(row_start='1'):

    Somecompute( key, data)

Here if i have 100 rows in hbase all the mapper instances spawned in the cluster is processing same 100 records from hbase since it is executing the same mapper code in the distribution hence duplication. My requirement is that m1 mapper should process 1 to 20 records,m2 mapper should process 21 to 40 records,m3 should process 41 to 60 etc.... How can i achieve this in python streaming using happy base? Could anyone please help. Thanks!!

han058 han058 · Accepted Answer · 2014-09-29T01:57:55

In happybase, using scan, row_start means deciding start row key.

so if start_rows are same, then results set will be same.

If you want to get the next set, You must set the row_start to last row-key of pre-result

Like below

1st scan : row_start=1 , result=[1:101], last_row=101 and Somecompute(1 ~ 100)
2nd scan : row_start=101 , result=[101:201], last_row=201 and Somecompute(101 ~ 200)
3rd scan : ....

I hope it will be helpful.

Python Mapreduce running on duplicate data read from hbase

1 Answers