2
votes

I tried creating an table in Hive and wanted to export it as Avro format.

Eventually I want to load this avro file to Google BigQuery. For some reason after the export the AVRO schema is not having the correct column names.

create table if not exists test_txt (id int, name varchar(40)); 
insert into test values (1, "AK");
insert overwrite directory "/tmp/test" stored as avro select * from test;
!sh hadoop fs -cat /tmp/test/*;

Output should have the column name as id, name but translated as _col0, _col1.

Objavro.schema▒{"type":"record","name":"baseRecord","fields":[{"name":"_col0","type":["null","int"],"default":null},{"name":"_col1","type":["null",{"type":"string","logicalType":"varchar","maxLength":40}],"default":null}]}▒Bh▒▒δ*@▒x~ AK▒Bh▒▒δ*@▒x~

Thanks,

AK

2
Also it is the same output for tables stored as AVRO as well !!! - AKS

2 Answers

1
votes

If an avro binary file needs to be exported to a single file for further ingestion (in my context to BigQuery) then dont use hadoop cat / insert overwrite statements. Use avro-tools and concat to a big avro file.

hadoop jar avro-tools-1.8.2.jar concat /tmp/test_avro/* big_avro_table.avro

0
votes

This appears to be the intended behaviour when exporting with the insert overwrite directory clause. This older thread is about this same issue. It's rather old but I believe the conclusion is still true (at least I could not find a direct way to keep the column names). It does include a couple of hacks to work around this though so it might be worth reading through.