11
votes

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.

Can I avoid having all my text data stored twice?

EDIT: I load it via the following command

LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Then, I can find the exact same file in:

/user/hive/warehouse/sandbox.db/test/day=20130220

I assumed it was copied.

4
How do you say, it's copied? How do you load them into hive tables? - Abimaran Kugathasan
I load it via LOAD DATA INPATH 'xxx' INTO TABLE yyy (see post edit) then I find the file in /user/hive/warehouse. I am wondering if it can leave it there (I guess I would have to enforce partition structure in my directories but that is fine) - Mad Echet
How it was stored in HDFS? - Abimaran Kugathasan
It is a CSV text file. It was put via a Java application. - Mad Echet
So, How can you tell, it's a HDFS Directory where you file stored? Can you check where hive.metastore.warehouse.dir property poins in your hive configuration? - Abimaran Kugathasan

4 Answers

15
votes

use an external table:

CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/user/logs/';

if you want to use partitioning with an external table, you will be responsible for managing the partition directories. the location specified must be an hdfs directory..

If you drop an external table hive WILL NOT delete the source data. If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.

4
votes

I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.

LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Notice the LOCAL

1
votes

You can use alter table partition statement to avoid data duplication.

create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';

ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';
0
votes

Hive (atleast when running in true cluster mode) can not refer to external files in local file system. Hive can automatically import the files during table creation or load operation. The reason behind this can be that Hive runs MapReduce jobs internally to extract the data. MapReduce reads from the HDFS as well as writes back to HDFS and even runs in distributed mode. So if the file is stored in local file system, it can not be used by the distributed infrastructure.