3
votes

I have a parquet file that is read by spark as an external table.

One of the columns is defined as int both in the parquet schema and in the spark table.

Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files. I changed also the type in the spark table to bigint.

However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.

Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.

Is there another way to read from parquet an int column as long in spark?

1
Did you find a solution ? - Saurabh7
no............. - Dror B.

1 Answers

2
votes

using pyspark couldn't you just do

df = spark.read.parquet('path to parquet files')

the just change the cast the column type in the dataframe

new_df = (df
          .withColumn('col_name', col('col_name').cast(LongType()))
         )

and then just save the new dataframe to same location with overwrite mode