6
votes

I am trying to load data from Gzip archive into Hive table but my gzip files have extension like for example:

apache_log.gz_localhost

When I specify HDFS directory location where these files are located Hive doesn't recognize GZip compressed files because it is searching for files with .gz extension.

Is it possible to define file type when loading data into Hive? Something like (PSEUDO):

set input.format=gzip;

LOAD DATA INPATH /tmp/logs/ INTO TABLE apache_logs;

Here is my SQL for table creation:

CREATE EXTERNAL TABLE access_logs (
`ip`                STRING,
`time_local`        STRING,
`method`            STRING,
`request_uri`       STRING,
`protocol`          STRING,
`status`            STRING,
`bytes_sent`        STRING,
`referer`           STRING,
`useragent`         STRING,
`bytes_received`    STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='^(\\S+) \\S+ \\S+ \\[([^\\[]+)\\] "(\\w+) (\\S+) (\\S+)" (\\d+) (\\d+|\-) "([^"]+)" "([^"]+)".* (\\d+)'
)
STORED AS TEXTFILE
LOCATION '/tmp/logs/';
1
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GZipCodec; at beginning of scriptSingleton
Unfortunately that does not help. Thank Youantunovic
Interesting question, hope it gets more attention.WestCoastProjects

1 Answers

7
votes

Why not change file name to xxx.gz after put in HDFS?

If you really wanna support .gz_localhost, I think you can custom your own GzipCodec to relize it:

  1. Create a your own NewGzipCodec Class which extend GzipCodec:

    public class NewGzipCodec extends org.apache.hadoop.io.compress.GzipCodec { }

  2. override method getDefaultExtension:

    public String getDefaultExtension() { return ".gz_locahost"; }

  3. javac and compress NewGzipCodec.class to NewGzipCodec.jar

  4. upload NewGzipCodec.jar to {$HADOOP_HOME}/lib

  5. set up your core-site.xml

<property>
  <name>io.compression.codecs</name>
  <value>NewGzipCodec, org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>