Mapping HDFS directory with .tsv files to Hive

Question

I have data into HFDS as a .tsv format. I need to load them into Hive table. I need some help.

Data into HDFS is like:

/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03

Note: Data is loaded into HDFS directory /ad_data/raw/reg_logs daily and hourly.

There are 3 .tsv files into this HDFS directory:

funel1.tsv
funel2.tsv
funel3.tsv

Each .tsv file has 3 columns separated by tab and has data like:

2344    -39 223
2344    -23 443
2394    -43 98
2377    -12 33
...
...

I want to create a Hive schema with 3 columns id int, region_code int and count int, exactly as in HDFS. If possible I want to remove that negative sign, in Hive table but not big deal.

I create a Hive table with schema: (please correct me if I am wrong)

CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING, utc_hour STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/raw/reg_logs';

All I want to do is copy data from HDFS to Hive. I do not want to use "load data inpath '..' into table reg_logs" because I do not want to manually enter data everyday. I just want to point Hive table to HDFS directory so it will get data for each day automatically.

How can I achieve it? Please correct my hive table schema if needed and way to get data there.

==

2nd part:

I want to create another table reg_logs_org which would get populated from reg_logs. I need to put every thing on reg_logs_org from reg_logs beside hour column.

Schema I created is:

CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs_org (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/reg_logs_org';

Insert data into reg_logs_org from reg_logs:

insert overwrite table reg_logs_org
select id, region_code, sum(count), utc_date
from 
reg_logs
group by 
utc_date, id, region_code

error message:

FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'reg_logs_org'

==

Thank you,
Rio

Joe K Joe K · Accepted Answer · 2014-06-12T21:33:17

You're very close. The last step is that you need to add the partition information to Hive's metastore. Hive stores the location of every partition individually, and it does not automatically find new partitions. There are two ways to add the partitions:

Every hour, do an add partition statement:

alter table reg_logs add partition(utc_date='2014-06-11', utc_hour='03')
location '/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03';

Every hour (or less frequently) do a table repair. This scans the root table location for any partitions it has not yet added.
```
msck repair table reg_logs;
```

The first approach is a bit more painful, but more efficient. The second approach is easy, but does a full scan of all partitions every time.

Edit: second half of question:

You just need to add some syntax for inserting into a table using dynamic partitions. In general, it is:

insert overwrite [table] partition([partition column])
select ...

Or in your case:

insert overwrite table reg_logs_org partition(utc_date)
select id, region_code, sum(count), utc_date
from 
reg_logs
group by 
utc_date, id, region_code

Mapping HDFS directory with .tsv files to Hive

1 Answers