I have questions regarding the "insert into" and "insert overwrite" commands in Hive.
I have a CSV file in HDFS. I want to extract values from few columns into a hive table. I load the full content into a HIVE table with one STRING column. I create the table:
CREATE TABLE temp_drivers (col_value STRING) STORED AS TEXTFILE
I load the data:
LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE temp_drivers
Each line of my CSV file is a line in the col_value column.
I create a second table:
CREATE TABLE drivers (driverId INT, name STRING, ssn BIGINT, location
STRING, certified STRING, wageplan STRING)
Then I insert some data from the first table into the second table by calling regexp_extract, let's say:
insert overwrite table drivers
SELECT
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) driverId,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) name,
regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) ssn,
regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) location,
regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) certified,
regexp_extract(col_value, '^(?:([^,]*),?){6}', 1) wageplan
from temp_drivers
The commands are available in this tutorial on Hortonworks: https://hortonworks.com/tutorial/how-to-process-data-with-apache-hive/
Two questions:
Does this second command write new data in HDFS to backup the second table. If yes, what would be the alternative to prevent HIVE form creating new data?
Can I do insert into instead of insert overwrite to append new data everytimeI receive a new CSV file?