Attach description of columns in Apache Spark using parquet format

Question

I read a parquet with :

df = spark.read.parquet(file_name)

And get the columns with:

df.columns

And returns a list of columns ['col1', 'col2', 'col3']

I read that parquet format is able to store some metadata in the file.

Is there a way to store and read extra metadata, for example, attach a human description of what is each column?

Thanks.

It looks like this is how parquet file was persisted (with no header or 'col1, etc.). I'd check it first. By default it stores column names and types. — mrjoseph

DemetriKots DemetriKots · Accepted Answer · 2019-05-29T19:47:43

There is no way to read or store arbitrary additional metadata in a Parquet file.

When metadata in a Parquet file is mentioned it is referring to the technical metadata associated with the field including the number of nested fields, type information, length information, etc. If you look at the SchemaElement class in the documentation for Parquet ( https://static.javadoc.io/org.apache.parquet/parquet-format/2.6.0/org/apache/parquet/format/SchemaElement.html) you will find all of the available metadata for each field in a schema. This does not include any human readable description beyond the field name.

A good overview of the Parquet metadata can be found in the "File Format" section here - https://parquet.apache.org/documentation/latest/

Attach description of columns in Apache Spark using parquet format

1 Answers