How to run Hive mapreduce tasks in all available nodes?

Question

I am new to Hadoop and Hive world.

I have written a Hive query which is processing 189 Million rows (40 GB file). While I am executing query. Hive query is executing in single machine and generating many map and reduce tasks. Is that expected behavior?

I have read in many articles Hadoop is distributed processing framework. What I was understanding Hadoop will split your job in multiple tasks and distribute those tasks in different nodes and once tasks finish reducer will join the output. Please correct me if I am wrong.

I have 1 master and 2 slave nodes. I am using Hadoop 2.2.0 and Hive 0.12.0.

jtravaglini jtravaglini · Accepted Answer · 2014-01-21T14:57:47

If you have 2 slave nodes, Hive will split its workload across the two, provided your cluster is properly configured.

That being said, if your input file is not splittable (for example, it's a GZIP compressed file), Hadoop will not able to split/parallelize the work, and you will be stuck with a single input split and thus a single mapper, limiting the workload to a single machine.

How to run Hive mapreduce tasks in all available nodes?

3 Answers