0
votes

as per my understanding on Hive concepts, if we load the dataset into hive table, the data file will move from source path to hive warehouse within HDFS, and HDFS was set to three replicas for the data.

these questions might look silly but as i am beginner, i want clear my doubts.

my questions are:

1) if i delete the hive table, will it delete data file from hive warehouse only or along other two replicas from HDFS also?

2)if we are processing query on hive table, will that query be done as distributed processing? per say, one data file is of size 1GB (interns 8 blocks x 128MB), and as we have three replication factor, there would be total 24 blocks available for this file will our hive query be distributed among all the data blocks or it would be processed on hive warehouse blocks only?

Thanks in advance..

1

1 Answers

0
votes

If you do "load data inpath" from a HDFS path the data will be moved from source to destination HDFS path, If you do "load data local inpath", it doesn't move data from local to HDFS path, instead it copies

For your question If you delete file in HDFS all the replicas are deleted.

If you have a 1gb file (8 blocks) with 3 replication factor, when you trigger the query in hive CLI, it converts your query to MR. It process only 8 blocks, in case of the datanode failure of the triggered job, it accesses the 2nd replica on a different node and processes the data (speculative execution)