Append a new column to an existing parquet file

Question

Is there any way to append a new column to an existing parquet file?

I'm currently working on a kaggle competition, and I've converted all the data to parquet files.

Here was the case, I read the parquet file into pyspark DataFrame, did some feature extraction and appended new columns to DataFrame with

pysaprk.DataFrame.withColumn().

After that, I want to save the new columns in the source parquet file.

I know Spark SQL come with Parquet schema evolution, but the example only have shown the case with a key-value.

The parquet "append" mode doesn't do the trick either. It only append new rows to the parquet file. If there's anyway to append a new column to an existing parquet file instead of generate the whole table again? Or I have to generate a separate new parquet file and join them on the runtime.

if you see architecturally , appending a new column to the existing parquet file can not be done..this is like playing around with the metadata of the parquet file.. — Aviral Kumar
Although you can try to rewrite it .. by first changing the schema.. I am not very sure though how this happens in spark-sql — Aviral Kumar
yeah, changing schema in spark-sql is easy, but overwriting the whole parquet file is costly which means I have to recompute the whole table again. Thanks for your comment, @AviralKumar — Chu-Yu Hsu

Daniel Sobrado Daniel Sobrado · Accepted Answer · 2017-01-21T07:24:29

In parquet you don't modify files, you read them, modify them and write them back, you cannot just change a column you need to read and write the full file.

Append a new column to an existing parquet file

3 Answers