I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS. I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file. My questions are 1. Does hive create a separate file directory? 2. Do I have to create hive table first and import data from HDFS? 3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns? 4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
2 Answers
My questions are
- Does hive create a separate file directory? YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
- Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
- Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
- create a external table with 300 column and point to the existing file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using
"insert into table30col select ... from table300col". Note: hive will create the file with 30 columns during this insert operation.- Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative. I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.