0
votes

I have a table which has some missing partions. When I call it on hive it works fine

SELECT *
  FROM my_table

but when call it from pyspark (v. 2.3.0) it fails with message Input path does not exist: hdfs://path/to/partition. The spark code I am running is just naive:

spark = ( SparkSession
         .builder
         .appName("prueba1")
         .master("yarn")
         .config("spark.sql.hive.verifyPartitionPath", "false")
         .enableHiveSupport()
         .getOrCreate())

spark.table('some_schema.my_table').show(10)

the config("spark.sql.hive.verifyPartitionPath", "false") has been proposed is this question but seems to not work fine for me

Is there any way I can configure SparkSession so I can get rid of these. I am afraid that in the future more partitions will miss, so a hardcode solution is not possible

2
Is your table external ? If so , can you try msck repair on the hive table thru hive or alter table add the partitions - K S Nidhin

2 Answers

1
votes

This error occurs when partitioned data dropped from HDFS i.e not using Hive commands to drop the partition.

If the data dropped from HDFS directly Hive doesn't know about the dropped the partition, when we query hive table it still looks for the directory and the directory doesn't exists in HDFS it results file not found exception.

To fix this issue we need to drop the partition associated with the directory in Hive table also by using

alter table <db_name>.<table_name> drop partition(<partition_col_name>=<partition_value>);

Then hive drops the partition from the metadata this is the only way to drop the metadata from the hive table if we dropped the partition directory from HDFS.

msck repair table doesn't drop the partitions instead only adds the new partitions if the new partition got added into HDFS.

The correct way to avoid these kind of issues in future drop the partitions by using Hive drop partition commands.

0
votes

Does the other way around, .config("spark.sql.hive.verifyPartitionPath", "true") work for you? I have just managed to load data using spark-sql with this setting, while one of the partition paths from Hive was empty, and partition still existed in Hive metastore. Though there are caveats - it seems it takes significantly more time to load data compared to when this setting it set to false.