I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner.
While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can be in HDFS or any other file system, Hive doesnt store data for such tables in warehouse. What about internal tables? This table will be created as a directory on one of the data nodes on Hadoop Cluster. Once we load data in these tables from local or HDFS file system, are there further files getting created to store data in tables created in Hive?
Say for example:
- A sample file named test_emp_feedback.csv was brought from local file system to HDFS.
- A table (emp_feedback) was created in Hive with a structure similar to csv file structure. This lead to creation of a directory in Hadoop cluster say /users/big_data/hive/emp_feedback
- Now once I create the table and load data in emp_feedback table from test_emp_feedback.csv
Is Hive going to create a copy of file in emp_feedback directory? Wont it cause data redundancy?