0
votes

My target is to perform a SELECT query using Hive

When I have a small data on a single machine (namenode), I start by: 1-Creating a table that contains this data: create table table1 (int col1, string col2) 2-Loading the data from a file path: load data local inpath 'path' into table table1; 3-Perform my SELECT query: select * from table1 where col1>0

I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.

Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.

Will Hive create a table at each datanode and perform the SELECT query or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)

2

2 Answers

2
votes

Ok, so I will walk through what happens when you load data into Hive.

The 10 million line file will be cut into 64MB/128MB blocks. Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster. These blocks will be replicated several times. Default is 3.

Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.

When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.

10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.

I hope I covered your question. If not please let me know and I'll try to help further.

2
votes

The query

select * from table1 where col1>0

is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.