1
votes

I was using HBase complete bulk load to transfer the output of ImportTsv to a table in HBase, and I noticed that it copies the output instead of cutting. This takes long time for my gigabytes of data.

In HBase documentation (http://hbase.apache.org/book/ops_mgt.html#completebulkload) I read that the files would be moved not copied. Can anyone help me with this?

I use Hbase 0.94.11 and Hadoop 1.2.1. The file system of bulkload output directory and hbase cluster are the same, too.

I've also coded a MapReduce job using HFileOutputFormat. When I use LoadIncrementalHFiles to move the output of my job to HBase table, it still copies instead of cut.

Kind Regards

1

1 Answers

2
votes

I noticed that the following lines are in Region server log, which causes copying instead of cut:

Region Server Log

File hdfs://master.mydomain/user/cluster/mbe/output/fam/8a6f322894784c9c9802e5b295025ee0 on different filesystem than destination store - moving to this filesystem. Copied to temporary path on dst filesystem: hdfs://master.mydomain:8020/hbase/MBE/fd9eab14bf12d1b44ea77aa3d1fc1b31/.tmp/d63966b6d5fa487f88426552d1ca43f4 Moved hfile hdfs://master.mydomain:8020/hbase/MBE/fd9eab14bf12d1b44ea77aa3d1fc1b31/.tmp/d63966b6d5fa487f88426552d1ca43f4 into store directory hdfs://master.mydomain:8020/hbase/MBE/fd9eab14bf12d1b44ea77aa3d1fc1b31/fam - updating store file list.

Solution

This shows that source and destination store files are on different file systems, but both of them are on same HDFS.

When i use "hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles hdfs://master.mydomain:8020/user/cluster/mbe/output MBE" instead of "hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles mbe/output MBE", the issue is resolved.

This problem was solved using absolute addressing with port number instead of relative addressing.

for more details, refer https://issues.apache.org/jira/browse/HBASE-9537