0
votes

I have a conceptual doubt in Hive. I know that Hive s a data warehouse tool that runs on top of Hadoop. We know that Hadoop has a distributed file system -HDFS.

Suppose, I have one master and three slaves. Now, I have created a table employees in HiveQL. The table is so huge that it cant be stored in one machine. Hence it must be stored in all four machines. How can I load such data. Should it be done manually. Or like I type "LOAD DATA ... " in the master and it will be automatically get distributed among all the machines.

1
It will be distributed automatically on data nodes. Namenode will only holds metadata of it. - Dhruv Kapatel

1 Answers

0
votes

Hive uses HDFS as warehouse to store the data. So HDFS concept is used for data storage.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Please refer HDFS architecture for more detail.