0
votes

I have questions regarding the "insert into" and "insert overwrite" commands in Hive.

I have a CSV file in HDFS. I want to extract values from few columns into a hive table. I load the full content into a HIVE table with one STRING column. I create the table:

CREATE TABLE temp_drivers (col_value STRING) STORED AS TEXTFILE

I load the data:

LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE temp_drivers

Each line of my CSV file is a line in the col_value column.

I create a second table:

CREATE TABLE drivers (driverId INT, name STRING, ssn BIGINT, location 
STRING, certified STRING, wageplan STRING)

Then I insert some data from the first table into the second table by calling regexp_extract, let's say:

insert overwrite table drivers
SELECT
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) name,
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) ssn,
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) location,
regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) certified,
regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) wageplan
from temp_drivers

The commands are available in this tutorial on Hortonworks: https://hortonworks.com/tutorial/how-to-process-data-with-apache-hive/

Two questions:

  • Does this second command write new data in HDFS to backup the second table. If yes, what would be the alternative to prevent HIVE form creating new data?

  • Can I do insert into instead of insert overwrite to append new data everytimeI receive a new CSV file?

1

1 Answers

0
votes

1) The second command :

LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE temp_drivers

will overwrite temp_drivers table and load the data which is present in drivers.csv

if you have multiple csv files which need to be loaded to temp_drivers table you can use below commands: LOAD DATA INPATH '/user/maria_dev/data1.csv' INTO TABLE temp_drivers; LOAD DATA INPATH '/user/maria_dev/data2.csv' INTO TABLE temp_drivers;

2) Yes you need to do Insert into to append the new data to the existing data