Spark-Hive partitioning

Question

The Hive table was created using 4 partitions.

CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets

The following lines in the spark code insert data into this table

 hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")

and in the spark-defaults.conf, the number of parallel processes is 128

spark.default.parallelism=128

The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets. The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.

Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?

More insight needed, specifically code for hourlies.write. But if i have to guess, you are trying to save data without using collect or take. So each executor is trying to save data to hdfs making it 128 (which is your parallelism). — Abhishek Anand
That is correct, the executors are writing to hdfs using the default parallelism. hourlies is a DataFrame and write is a function defined for the DataFrame — sparkDabbler

Abhishek Madav Abhishek Madav · Accepted Answer · 2017-09-26T18:22:54

As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the .bucketBy([colName]) API while using DataFrameWriter API.

More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Spark-Hive partitioning

1 Answers