Is it possible to import data into Hive table without copying the data

Question

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.

Can I avoid having all my text data stored twice?

EDIT: I load it via the following command

LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Then, I can find the exact same file in:

/user/hive/warehouse/sandbox.db/test/day=20130220

I assumed it was copied.

How do you say, it's copied? How do you load them into hive tables? — Abimaran Kugathasan
I load it via LOAD DATA INPATH 'xxx' INTO TABLE yyy (see post edit) then I find the file in /user/hive/warehouse. I am wondering if it can leave it there (I guess I would have to enforce partition structure in my directories but that is fine) — Mad Echet
So, How can you tell, it's a HDFS Directory where you file stored? Can you check where hive.metastore.warehouse.dir property poins in your hive configuration? — Abimaran Kugathasan

cran1um cran1um · Accepted Answer · 2013-03-07T22:48:59

use an external table:

CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/user/logs/';

if you want to use partitioning with an external table, you will be responsible for managing the partition directories. the location specified must be an hdfs directory..

If you drop an external table hive WILL NOT delete the source data. If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.

Is it possible to import data into Hive table without copying the data

4 Answers