Spark write data into partitioned Hive table very slow

Question

I want to store Spark dataframe into Hive table in normal readable text format. For doing so I first did

sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")

My DataFrame is like:

final_data1_df = sqlContext.sql("select a, b from final_data")

and I am trying to write it by:

final_data1_df.write.partitionBy("b").mode("overwrite").saveAsTable("eefe_lstr3.final_data1")

but this is very slow, even slower than HIVE table write. So to resolve this I thought to define partition through Hive DDL statement and then load data like:

sqlContext.sql("""
CREATE TABLE IF NOT EXISTS eefe_lstr3.final_data1(
a BIGINT
)
PARTITIONED BY (b INT)
"""
)
sqlContext.sql("""
INSERT OVERWRITE TABLE eefe_lstr3.final_data1 PARTITION (stategroup)
select * from final_data1""")

but this is giving partitioned Hive table but still parquet formatted data. Am I missing something here?

What is the exact error message you are getting? Also, are you sure that your sqlContext = HiveContext(sc) ? — KartikKannapur
yes my sqlContext is in fact HiveContext. I am not getting any error. In first case writing is slow. In second case data is still parquet. — abhiieor
No. I was trying this but eventually due to deadlines pressures moved back to map only architecture from Spark. — abhiieor

Gonzalo Herreros Gonzalo Herreros · Accepted Answer · 2017-09-01T23:23:08

When you create the table explicitly then that DDL defines the table. Normally text file is the default in Hive but it could have been changed in your environment.

Add "STORED AS TEXTFILE" at the end of the CREATE statement to make sure the table is plain text.

Spark write data into partitioned Hive table very slow

1 Answers