I am trying to load data from Gzip archive into Hive table but my gzip files have extension like for example:
apache_log.gz_localhost
When I specify HDFS directory location where these files are located Hive doesn't recognize GZip compressed files because it is searching for files with .gz extension.
Is it possible to define file type when loading data into Hive? Something like (PSEUDO):
set input.format=gzip;
LOAD DATA INPATH /tmp/logs/ INTO TABLE apache_logs;
Here is my SQL for table creation:
CREATE EXTERNAL TABLE access_logs (
`ip` STRING,
`time_local` STRING,
`method` STRING,
`request_uri` STRING,
`protocol` STRING,
`status` STRING,
`bytes_sent` STRING,
`referer` STRING,
`useragent` STRING,
`bytes_received` STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='^(\\S+) \\S+ \\S+ \\[([^\\[]+)\\] "(\\w+) (\\S+) (\\S+)" (\\d+) (\\d+|\-) "([^"]+)" "([^"]+)".* (\\d+)'
)
STORED AS TEXTFILE
LOCATION '/tmp/logs/';
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GZipCodec;
at beginning of script – Singleton