In my work, I import AVRO files into impala tables by copy the files into HDFS then execute "refresh " in impala.
But when I want to do it with compressed files, it didn't work.
The only document I've found about enable compress with avro tables is this link: http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_avro.html#avro_compression_unique_1 .
Here is what I do:
Enable Hive compress in hive shell:
hive> set hive.exec.compress.output=true;
hive> set avro.output.codec=bzip2;
Create a table:
CREATE TABLE
log_bzip2(timestampbigint COMMENT 'from deserializer',appidstring COMMENT 'from deserializer',clientidstring COMMENT 'from deserializer',statkeystring COMMENT 'from deserializer',expidstring COMMENT 'from deserializer',modidstring COMMENT 'from deserializer',valuedouble COMMENT 'from deserializer',summarystring COMMENT 'from deserializer',customstring COMMENT 'from deserializer') PARTITIONED BY (daystring) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='hdfs://szq2.appadhoc.com:8020/user/hive/log.avsc');Load the compressed AVRO file into HDFS:
hdfs dfs -put log.2016-03-07.1457184357726.avro.bz2 /user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07
Add partition and refresh in Impala shell:
alter table log_bzip2 add partition (day="2016-03-07") location '/user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07/';
refresh log_bzip2;
Query it but not work:
select * from log_bzip2 limit 10; Query: select * from log_bzip2 limit 10
WARNINGS: Invalid AVRO_VERSION_HEADER: '42 5a 68 39 '
How can I do it right? Thanks!