1
votes

I am a newbie to Hadoop and Hive. I am using Hive integration with Hadoop to execute the queries. When I submit any query, following log messages appear on console:

Hive history file=/tmp/root/hive_job_log_root_28058@hadoop2_201203062232_1076893031.txt Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201203062223_0004, Tracking URL = http://:50030/jobdetails.jsp?jobid=job_201203062223_0004 Kill Command = //opt/hadoop_installation/hadoop-0.20.2/bin/../bin/hadoop job -kill job_201203062223_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2012-03-06 22:32:26,707 Stage-1 map = 0%, reduce = 0% 2012-03-06 22:32:29,716 Stage-1 map = 100%, reduce = 0% 2012-03-06 22:32:38,748 Stage-1 map = 100%, reduce = 100% Ended Job = job_201203062223_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 8107686 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK

The text mentioned in bold starts a hadoop job (that's what I believe). It takes long time to start the job. Once this line gets executed, the map reduce operations execute swiftly. Following are my questions:

  1. Is there any way to make the launch of hadoop job faster. Is it possible to skip this phase?
  2. Where does the value of 'Kill command' come from (in the bold text)?

Please let me know if any inputs are required.

3

3 Answers

1
votes

1) Starting Job = job_201203062223_0004, Tracking URL = http: :50030/jobdetails.jsp?jobid=job_201203062223_0004

ANS: your HQL query > translated to hadoop job > hadoop will do some background work (like planning resources,data locality,stages needed to process query,launch configs,job,taskids generation etc) > launch mappers > sort && shuffle > reduce (aggregation) > result to hdfs .

The above flow is part of hadoop job life cycle, so no skipping of any..

http://namenode:port/jobtracker.jsp --- you can see ur job status with job-id :job_201203062223_0004, (Monitering)

2) Kill Command = HADOOP_HOME/bin/hadoop job -kill job_201203062223_0004

Ans : before launching your mappers, you will be showed with these lines because, hadoop works on bigdata, which may take much or less time depends on your dataset size. so at any point of time if you want to kill the job, its a help line . For any hadoop-job this line will be shown, it won't take much time to show an info line like this.


some addons with respect to your comments :

  • Hive is not meant for low Latency jobs , i mean immediate in time results not possible. (plz check the hive -purposes in apache.hive)
  • launching overhead(refer q1s - hadoop will do some background work) is there in Hive, it cant be avoided.

  • Even for datasets of small size, these launching over head is there in hadoop.

PS : if you are really expecting in time quick results ( plz refer shark )

0
votes

first,Hive is the tool which replace your mr work by HQL.In the background,it has lost of predefined funcitions,mr programes.Run a HQL,HADOOP Cluster will do lost of things,find the data blocks,allocating task,and so on.

Second,you can kill a job by the hadoop shell command. If you job id is AAAAA. you can execute below command to kill it

$HADOOP_HOME/bin/hadoop job -kill AAAAA
0
votes

Launch of hadoop job can get delayed due to unavailability of resources. If you use yarn you can see that the jobs are in accepted state but not yet running. This means there is some other ongoing job that has consumed all your executors and the new query is waiting to run.

You can kill the older job by using hadoop job -kill <job_id> command or wait for it to finish.