Using evolving avro schema for impala/hive storage

Question

We have a JSON structure that we need to parse and use it in impala/hive. Since the JSON structure is evolving, we thought we can use Avro.

We have planned to parse the JSON and format it as avro.

The avro formatted data can be used directly by impala. Lets say we store it in HDFS directory /user/hdfs/person_data/

We will keep putting avro serialized data in that folder as and we will be parsing input json one by one.

Lets say, we have a avro schema file for person (hdfs://user/hdfs/avro/scheams/person.avsc) like

{
 "type": "record",
 "namespace": "avro",
 "name": "PersonInfo",
 "fields": [
   { "name": "first", "type": "string" },
   { "name": "last", "type": "string" },
   { "name": "age", "type": "int" }
 ]
}

For this we will create table in hive by creating external table -

CREATE TABLE kst
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs://user/hdfs/avro/scheams/person.avsc');

Lets say tomorrow we need to change this schema (hdfs://user/hdfs/avro/scheams/person.avsc) to -

{
 "type": "record",
 "namespace": "avro",
 "name": "PersonInfo",
 "fields": [
   { "name": "first", "type": "string" },
   { "name": "last", "type": "string" },
   { "name": "age", "type": "int" },
   { "name": "city", "type": "string" }
 ]
}

Can we keep putting the new seriliazied data in same HDFS directory /user/hdfs/person_data/ and impala/hive will still work by giving city column as NULL value old records?

alexeipab alexeipab · Accepted Answer · 2016-04-02T22:43:15

Yes, you can, but for all new columns you should specify a default value:

{ "name": "newField", "type": "int", "default":999 }

or mark them as nullable:

{ "name": "newField", "type": ["null", "int"] }

Using evolving avro schema for impala/hive storage

1 Answers