0
votes

I am a beginner in Hadoop and HBase, and I'm learning how to import large massive data(an 8GB tsv file stored in HDFS) into HBase by using 'importtsv'. However, the mapreduce job seems really slow, and after a long time, it failed. Maybe the file is too big and it makes the cluster crash down. When I change to use a small tsv file, it works well. So how to speed up the mapreduce job in my case if I insist importing such a big file? Is there any cache configuration in Hadoop can help this?

I have one macOS namenode, and two Ubuntu datanodes.

Command of importing:

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv  -Dimporttsv.columns=HBASE_ROW_KEY,info:action,timestamp,info:name,info:bank,info:account records /user/root

Error info:

2017-03-20 16:48:27,136 INFO  [main] zookeeper.ZooKeeper: Client environment:java.library.path=/usr/local/hadoop/lib/native
2017-03-20 16:48:27,136 INFO  [main] zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/var/folders/nl/f3lktfgn7jg46jycx21cxfmr0000gn/T/
2017-03-20 16:48:27,136 INFO  [main] zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
2017-03-20 16:48:27,137 INFO  [main] zookeeper.ZooKeeper: Client environment:os.name=Mac OS X
2017-03-20 16:48:27,137 INFO  [main] zookeeper.ZooKeeper: Client environment:os.arch=x86_64
2017-03-20 16:48:27,137 INFO  [main] zookeeper.ZooKeeper: Client environment:os.version=10.12.3
2017-03-20 16:48:27,137 INFO  [main] zookeeper.ZooKeeper: Client environment:user.name=haohui
2017-03-20 16:48:27,137 INFO  [main] zookeeper.ZooKeeper: Client environment:user.home=/Users/haohui
2017-03-20 16:48:27,138 INFO  [main] zookeeper.ZooKeeper: Client environment:user.dir=/Users/haohui
2017-03-20 16:48:27,138 INFO  [main] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=30000 watcher=hconnection-0x3fc2959f0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase
2017-03-20 16:48:27,157 INFO  [main-SendThread(master:2181)] zookeeper.ClientCnxn: Opening socket connection to server master/10.211.55.2:2181. Will not attempt to authenticate using SASL (unknown error)
2017-03-20 16:48:27,188 INFO  [main-SendThread(master:2181)] zookeeper.ClientCnxn: Socket connection established to master/10.211.55.2:2181, initiating session
2017-03-20 16:48:27,200 INFO  [main-SendThread(master:2181)] zookeeper.ClientCnxn: Session establishment complete on server master/10.211.55.2:2181, sessionid = 0x15aeae6867a0001, negotiated timeout = 30000
2017-03-20 16:48:56,396 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-03-20 16:48:56,441 INFO  [main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15aeae6867a0001
2017-03-20 16:48:56,450 INFO  [main] zookeeper.ZooKeeper: Session: 0x15aeae6867a0001 closed
2017-03-20 16:48:56,450 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2017-03-20 16:48:56,524 INFO  [main] client.RMProxy: Connecting to ResourceManager at master/10.211.55.2:8032
2017-03-20 16:48:56,666 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-03-20 16:48:58,818 INFO  [main] input.FileInputFormat: Total input paths to process : 1
2017-03-20 16:48:58,873 INFO  [main] mapreduce.JobSubmitter: number of splits:56
2017-03-20 16:48:58,884 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-03-20 16:48:59,006 INFO  [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1489999688045_0001
2017-03-20 16:48:59,319 INFO  [main] impl.YarnClientImpl: Submitted application application_1489999688045_0001
2017-03-20 16:48:59,370 INFO  [main] mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1489999688045_0001/
2017-03-20 16:48:59,371 INFO  [main] mapreduce.Job: Running job: job_1489999688045_0001
2017-03-20 16:49:09,668 INFO  [main] mapreduce.Job: Job job_1489999688045_0001 running in uber mode : false
2017-03-20 16:49:09,670 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2017-03-20 17:00:09,103 INFO  [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000009_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000009_0 Timed out after 600 secs
2017-03-20 17:00:09,127 INFO  [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000011_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000011_0 Timed out after 600 secs
2017-03-20 17:00:09,128 INFO  [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000010_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000010_0 Timed out after 600 secs
2017-03-20 17:00:09,129 INFO  [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000013_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000013_0 Timed out after 600 secs
2017-03-20 17:00:09,130 INFO  [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000008_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000008_0 Timed out after 600 secs
2017-03-20 17:00:09,131 INFO  [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000012_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000012_0 Timed out after 600 secs
2

2 Answers

0
votes

I'm not sure how to speed your operation as it really depends on your schema and the data. You can gain some information about optimal design of rows in this article. As for the crash it is likely that your job throws timeout exception due to long-running computation in your bulk step without reporting the progress back to YARN in the mapreduce job scheduled by ImportTsv utility. You can increase the timeout in mapred-site.xml file:

<property>
  <name>mapred.task.timeout</name>
  <value>2000000</value> <!-- A value 2000000 equals 2000 secs -->
</property>

Or you can set it to 0 which will disable the timeout for your jobs but this is considered to be a bad practice as you will have a risk to deal with potential zombies in your cluster.

0
votes

Well, set the mapred.task.timeout to a larger value definitely can help avoiding timeout, but it still needs to wait for a long time to operate. I finally found a more effective way to speed up mapreduce process and avoid crash, to increase the memory and cpu resources in all nodes:

add to yarn-site.xml:

        <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>4096</value>
        </property>
        <property>
            <name>yarn.nodemanager.resource.cpu-vcores</name>
            <value>2</value>
        </property>
        <property>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>4096</value>
        </property>

add to mapred.xml:

        <property>
            <name>yarn.app.mapreduce.am.resource.mb</name>>
            <value>4096</value>>
        </property>>
        <property>
            <name>yarn.app.mapreduce.am.command-opts</name>>
            <value>-Xmx3768m</value>>
        </property>>
        <property>
            <name>mapreduce.map.cpu.vcores</name>>
            <value>2</value>>
        </property>>
        <property>
            <name>mapreduce.reduce.cpu.vcores</name>>
            <value>2</value>>
        </property>>