1
votes

I have 6 node datastax cassandra cluster(3 cassandra- 3 analytics). I am using hive to generate reports.

The issue is when I run a hive job with count(*) or group by query, mappers get completed but the reducers get stuck to a particular percent. And they never change.

(All the nodes have 60GB of data. The column family on which I am running hive has around 40GB on each node and has 15 columns, each with data size of 50,60 characters. I am using default hive settings.)

Also is their information available on time taken vs data size of hive job(normal query). I could find on net.

This is what happens

Query is very simple - select count(*) from table; It has 300k rows.

Starting Job = job_201305210728_0005, Tracking URL = http://something:50030/jobdetails.jsp?jobid=job_201305210728_0005
Kill Command = ./bin/dse hadoop job  -Dmapred.job.tracker=10.70.41.62:8012 -kill job_201305210728_0005
Hadoop job information for Stage-1: number of mappers: 7; number of reducers: 1
2013-05-22 03:52:58,811 Stage-1 map = 0%,  reduce = 0%
2013-05-22 03:53:02,841 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 1.87 sec
2013-05-22 03:53:03,863 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 1.87 sec
2013-05-22 03:53:04,874 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 2.59 sec
2013-05-22 03:53:05,882 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 2.59 sec
2013-05-22 03:53:06,894 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 3.42 sec
2013-05-22 03:53:07,901 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 3.42 sec
2013-05-22 03:53:08,908 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 5.13 sec
2013-05-22 03:53:09,919 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 5.13 sec
2013-05-22 03:53:10,929 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU 5.51 sec
2013-05-22 03:53:11,936 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 7.79 sec
2013-05-22 03:53:12,943 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 7.79 sec
2013-05-22 03:53:13,951 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 7.79 sec
2013-05-22 03:53:14,958 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 10.25 sec
2013-05-22 03:53:15,965 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 10.25 sec
2013-05-22 03:53:16,972 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 10.25 sec
2013-05-22 03:53:17,979 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:18,986 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:19,993 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:20,999 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:22,007 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:23,014 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:24,021 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:25,028 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:26,035 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:27,042 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:28,049 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:29,055 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:30,063 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:31,070 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:32,077 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:33,084 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:34,091 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:35,097 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:36,105 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:37,112 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:38,118 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:39,125 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:40,132 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:41,139 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:42,145 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:43,153 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:44,160 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:45,166 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:46,172 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:47,180 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:48,186 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:49,193 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:50,201 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:51,208 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:52,214 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:53,221 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:54,227 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:55,235 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:56,241 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:57,249 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec
2013-05-22 03:53:58,256 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 10.25 sec

and continues....

I am attaching hive logs

 Counters plan="{"queryId":"root_20130523051919_5e968691-9119-47fa-9b45-f61a90e86452","queryType":null,"queryAttributes":{"queryString":"select count(*) from suborder_issue"},"queryCounters":"null","stageGraph":{"nodeType":"STAGE","roots":"null","adjacencyList":"]"},"stageList":[{"stageId":"Stage-1","stageType":"MAPRED","stageAttributes":"null","stageCounters":{"CNTR_NAME_Stage-1_REDUCE_PROGRESS":"14","CNTR_NAME_Stage-1_MAP_PROGRESS":"100"},"taskList":[{"taskId":"Stage-1_MAP","taskType":"MAP","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"TS_0","children":["SEL_1"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_1","children":["GBY_2"],"adjacencyType":"CONJUNCTIVE"},{"node":"GBY_2","children":["RS_3"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"TS_0","operatorType":"TABLESCAN","operatorAttributes":"null","operatorCounters":"}","done":"true","started":"true"},{"operatorId":"SEL_1","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"GBY_2","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"RS_3","operatorType":"REDUCESINK","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"}],"done":"true","started":"true"},{"taskId":"Stage-1_REDUCE","taskType":"REDUCE","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"GBY_4","children":["SEL_5"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_5","children":["FS_6"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"GBY_4","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"}","done":"false","started":"true"},{"operatorId":"SEL_5","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"},{"operatorId":"FS_6","operatorType":"FILESINK","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}" TIME="1369303703813"
TaskProgress TASK_HADOOP_PROGRESS="2013-05-23 06:08:24,818 Stage-1 map = 100%,  reduce = 14%, Cumulative CPU 23.04 sec" TASK_NUM_REDUCERS="1" TASK_NAME="org.apache.hadoop.hive.ql.exec.MapRedTask" TASK_NUM_MAPPERS="7" TASK_COUNTERS="org.apache.hadoop.hive.ql.exec.Operator$ProgressCounter.CREATED_FILES:0,Job Counters .Launched reduce tasks:1,Job Counters .SLOTS_MILLIS_MAPS:41820,Job Counters .Rack-local map tasks:8,Job Counters .Launched map tasks:8,File Input Format Counters .Bytes Read:0,FileSystemCounters.FILE_BYTES_WRITTEN:327828,FileSystemCounters.CFS_BYTES_READ:3527,org.apache.hadoop.hive.ql.exec.MapOperator$Counter.DESERIALIZE_ERRORS:0,Map-Reduce Framework.Map output materialized bytes:203,Map-Reduce Framework.Map input records:351783,Map-Reduce Framework.Reduce shuffle bytes:78,Map-Reduce Framework.Spilled Records:6,Map-Reduce Framework.Map output bytes:79,Map-Reduce Framework.Total committed heap usage (bytes):2057568256,Map-Reduce Framework.CPU time spent (ms):23040,Map-Reduce Framework.Map input bytes:351783,Map-Reduce Framework.SPLIT_RAW_BYTES:3527,Map-Reduce Framework.Combine input records:0,Map-Reduce Framework.Reduce input records:0,Map-Reduce Framework.Reduce input groups:0,Map-Reduce Framework.Combine output records:0,Map-Reduce Framework.Physical memory (bytes) snapshot:2163638272,Map-Reduce Framework.Reduce output records:0,Map-Reduce Framework.Virtual memory (bytes) snapshot:7556251648,Map-Reduce Framework.Map output records:6" TASK_ID="Stage-1" QUERY_ID="root_20130523051919_5e968691-9119-47fa-9b45-f61a90e86452" TASK_HADOOP_ID="job_201305220529_0009" TIME="1369303704818"
Counters plan="{"queryId":"root_20130523051919_5e968691-9119-47fa-9b45-f61a90e86452","queryType":null,"queryAttributes":{"queryString":"select count(*) from suborder_issue"},"queryCounters":"null","stageGraph":{"nodeType":"STAGE","roots":"null","adjacencyList":"]"},"stageList":[{"stageId":"Stage-1","stageType":"MAPRED","stageAttributes":"null","stageCounters":{"CNTR_NAME_Stage-1_REDUCE_PROGRESS":"14","CNTR_NAME_Stage-1_MAP_PROGRESS":"100"},"taskList":[{"taskId":"Stage-1_MAP","taskType":"MAP","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"TS_0","children":["SEL_1"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_1","children":["GBY_2"],"adjacencyType":"CONJUNCTIVE"},{"node":"GBY_2","children":["RS_3"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"TS_0","operatorType":"TABLESCAN","operatorAttributes":"null","operatorCounters":"}","done":"true","started":"true"},{"operatorId":"SEL_1","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"GBY_2","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"},{"operatorId":"RS_3","operatorType":"REDUCESINK","operatorAttributes":"null","operatorCounters":"null","done":"true","started":"true"}],"done":"true","started":"true"},{"taskId":"Stage-1_REDUCE","taskType":"REDUCE","taskAttributes":"null","taskCounters":"null","operatorGraph":{"nodeType":"OPERATOR","roots":"null","adjacencyList":[{"node":"GBY_4","children":["SEL_5"],"adjacencyType":"CONJUNCTIVE"},{"node":"SEL_5","children":["FS_6"],"adjacencyType":"CONJUNCTIVE"}]},"operatorList":[{"operatorId":"GBY_4","operatorType":"GROUPBY","operatorAttributes":"null","operatorCounters":"}","done":"false","started":"true"},{"operatorId":"SEL_5","operatorType":"SELECT","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"},{"operatorId":"FS_6","operatorType":"FILESINK","operatorAttributes":"null","operatorCounters":"null","done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}],"done":"false","started":"true"}" TIME="1369303704818"
2
Could you share examples of actual queries that got stuck in the past? That would help in finding a failure signature.Lukas Vermeer
I have added query and job detailsAashish Katta

2 Answers

0
votes

Two things I can thing of. You either are running out of memory and need to increase your heap size, or you're running out of disk space on your data nodes. Check your log partition and make sure they're not filled up.

0
votes

The only 100% advice is to carefully look inside reducer logs (you should have them on cluster, probably JobTracker web UI is best place to start with). Number of possible reasons is too much.