0
votes

I am new to Hadoop and Hive world.

I have written a Hive query which is processing 189 Million rows (40 GB file). While I am executing query. Hive query is executing in single machine and generating many map and reduce tasks. Is that expected behavior?

I have read in many articles Hadoop is distributed processing framework. What I was understanding Hadoop will split your job in multiple tasks and distribute those tasks in different nodes and once tasks finish reducer will join the output. Please correct me if I am wrong.

I have 1 master and 2 slave nodes. I am using Hadoop 2.2.0 and Hive 0.12.0.

3

3 Answers

0
votes

If you have 2 slave nodes, Hive will split its workload across the two, provided your cluster is properly configured.

That being said, if your input file is not splittable (for example, it's a GZIP compressed file), Hadoop will not able to split/parallelize the work, and you will be stuck with a single input split and thus a single mapper, limiting the workload to a single machine.

0
votes

your understanding about hive is correct- hive translates your Query to hadoop job which in turn gets split into multiple tasks, distribute to nodes,map > sort&shuffle > reduce aggregate > return to hive CLI.

0
votes

Thank you all for your quick reply.

you all correct my job is converted into different task and distributed to nodes.

While I am checking Hadoop Web UI in first level it was showing job is running in single node. While I drill down further it is showing Mappers and Reducers and where the are running.

Thanks :)