0
votes

I have configured map capacity with 4000 maps, and configure each job with 500 maps, based on my understanding of FIFO mode and the link Running jobs parallely in hadoop if I submit 8 jobs, these 8 jobs should run in parallel, right? However, I still see that the 8 jobs I submitted run in sequential, which is something make me feel strange. Another way is to try fair scheduler, but I have some other running bugs... How to make this run in parallel?

I am the only user now.

Question: what does the job tracker web UI show for total running jobs?

Actually I have submitted like 80 jobs, so all jobs are submitted successfully since I can see 80 of them under "Running Jobs" section, but they just run sequentially

Question: how many input files are you currently processing? what does this relate to with regards to the number of mappers for the job?

Since for each job I configure 500 maps through mapred-site.xml setting map.task.num=500.

below is the information

Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts

map 1.40% 500 402 91 7 0 0 / 0

reduce 0.00% 1 1 0 0 0 0 / 0

Question: You can configure your Input format to only run 500 maps, but there are occasions where Hadoop ignores this value: if you have more then 500 input files, for example.

I am sure this will not happen, since I customized the inputformat, so that the number of mappers to run is exactly the number of mappers I configure in mapred-site.xml

Question: When you start your job, how many files are you running over, what's the Input Format you are using, and what if any file compression are you using on the input files

Ok, I actually run only one file, but this file will be fully loaded to all maptasks, so I actually use the distrbutecache mechanism to let each maptask load this file fully. I did not use compression currently

Question: What does the job tracker show for the total number of configured mapper and reducer slots? Does this match up with your expected value of 5000?

Below is the information

Maps Reduces TotalSubmissions Nodes Map Task Capacity Reduce Task Capacity Avg. Tasks/Node Blacklisted Nodes

83 0 80 8 4000 80 510.00 0

1
Can you confirm what scheduler you are using (open up a running / run job, and examine the job.xml configured property for mapred.jobtracker.taskScheduler) - Chris White

1 Answers

0
votes

Whether you run the FairScheduler or the CapacityScheduler, you should still be able to run jobs in parallel, but there are some reasons that you may see that your jobs run sequentially:

  • Are you the only person using the cluster, if not, how many other people are using it:
    • Question: what does the job tracker web UI show for total running jobs?
  • If you are indeed the only job(s) running on the cluster at a particular point in time, then check the Job Tracker web UI for your currently running job - how many input files are you currently processing? what does this relate to with regards to the number of mappers for the job?
    • You can configure your Input format to only run 500 maps, but there are occasions where Hadoop ignores this value: if you have more then 500 input files, for example.
    • Question: When you start your job, how many files are you running over, what's the Input Format you are using, and what if any file compression are you using on the input files
  • Question: What does the job tracker show for the total number of configured mapper and reducer slots? Does this match up with your expected value of 5000?