how to specify column description in parquet schema definition

Question

I am using cascading to convert Text Delimited to parquet & avro files. I am not able to provide description for columns in parquet metadata the same way way Avro has it. This will be helpful when anyone is using the data set to get some description about the field in the data set itself.

Below is the Parquet Schema:

message LaunchApplication {
   required int field1;
   required binary field2;
   optional binary field3;
   required binary field4;
 }

Below is the avro schema:

{ "type":"record", "name":"CascadingAvroSchema", "namespace":"", "fields":[
  {"name":"field1","type":"int","doc":"10,NOT NULL, KeyField"},
  {"name":"field2","type":"string","doc":"5,NOT NULL, FLAG, Indicator},
  {"name":"field3","type":["null","string"],"doc":"20,NULL, System Field."},
  {"name":"field4","type":"string","doc":"20,NOT NULL,MM/DD/YYYY,Record Changed Date."}  ]
}

How do i keep track of the "doc" section in the avro file in parquet as well ?

Zoltan Zoltan · Accepted Answer · 2018-09-21T15:46:36

Actually Parquet supports Avro schemas as well. If you use an Avro schema, Parquet will infer the Parquet schema from it and also store the Avro schema in the metadata.

how to specify column description in parquet schema definition

1 Answers