Auto Cast parquet to Hive

Question

I have a scenario where spark infers schema from the input file and writes parquet files with Integer Data Types.

But we have tables in hive where the fields are defined as BigInt. Right now there is no conversion from int to long and hive throws errors that it cannot cast Integer to Long. I cannot edit the Hive DDL to Integer data types as it is business requirement to have those fields as Long. I have looked up the option where we can cast the data types before saving.This can be done except that i have hundreds of columns and explicit cast makes code very messy.

Is there a way to tell spark to auto cast data types.

Manish Saraf Bhardwaj Manish Saraf Bhardwaj · Accepted Answer · 2017-05-11T04:48:09

Since Spark version 1.4 you can apply the cast method with DataType on the column:

Suppose dataframe df has column year : Long

import org.apache.spark.sql.types.IntegerType
val df2 = df.withColumn("yearTmp", df.year.cast(IntegerType))
    .drop("year")
    .withColumnRenamed("yearTmp", "year")

If you are using sql expressions you can also do:

val df2 = df.selectExpr("cast(year as int) year", 
                        "make", 
                        "model", 
                        "comment", 
                        "blank")

For more info check the docs: http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame

Auto Cast parquet to Hive

1 Answers