1
votes

I wrote a MR job in python running by streaming jar package. I want to know how to use bulk load to put data into HBase.

I konw that there are 2 ways to get the data into hbase by bulk loading.

  1. generate the HFiles in MR job, and use CompleteBulkLoad to load data into hbase.
  2. use ImportTsv option and then use CompleteBulkLoad to load data.

I don't know how to use python generate HFile to fits in Hbase. And then I try to use ImportTsv utility. But got failure. I followed the instructions in this [example](http://hbase.apache.org/book.html#importtsv).But I got exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/Filter...

Now I want to ask 3 questions:

  1. Whether Python could be used to generate HFile by streaming jar or not.
  2. How to use importtsv.
  3. Could bulkload be used to update the table in Hbase. I get a big file bigger than 10GB every day. Could bulkload be used to push the file into Hbase.

The hadoop version is: Hadoop 2.8.0

The hbase version is: HBase 1.2.6

Both running in standalone mode.

Thanks for any answer.

--- update ---

ImportTsv works correctly.

But I stil want to know how to generate the HFile in MR job by streaming jar in Python language.

1

1 Answers

0
votes

You could try the happyBase.

table = connection.table("mytable")
with table.batch(batch_size=1000) as b:
    for i in range(1200):

        b.put(b'row-%04d'.format(i), {
           b'cf1:col1': b'v1',
           b'cf1:col2': b'v2',
        })

As you may have imagined already, a Batch keeps all mutations in memory until the batch is sent, either by calling Batch.send() explicitly, or when the with block ends. This doesn’t work for applications that need to store huge amounts of data, since it may result in batches that are too big to send in one round-trip, or in batches that use too much memory. For these cases, the batch_size argument can be specified. The batch_size acts as a threshold: a Batch instance automatically sends all pending mutations when there are more than batch_size pending operations.

This need a Thrift server stand before hbase. Just a suggestion.