13
votes

I'm using spark with scala to read a specific Hive partition. The partition is year, month, day, a and b

scala> spark.sql("select * from db.table where year=2019 and month=2 and day=28 and a='y' and b='z'").show

But I get this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 236 in stage 0.0 failed 4 times, most recent failure: Lost task 236.3 in stage 0.0 (TID 287, server, executor 17): org.apache.hadoop.security.AccessControlException: Permission denied: user=user, access=READ, inode="/path-to-table/table/year=2019/month=2/day=27/a=w/b=x/part-00002":user:group:-rw-rw----

As you can see, spark is trying to read a different partition and I don't have permisions there.

It shouldn't be, because I created a filter and this filter is my partition.

I tried the same query with Hive and it's works perfectly (No access problems)

Hive> select * from db.table where year=2019 and month=2 and day=28 and a='y' and b='z';

Why is spark trying to read this partition and Hive doesn't?

There is a Spark configuration that am I missing?

Edit: More information

Some files were created with Hive, others were copied from one server and pasted to our server with different permissions (we can not change the permissions), then they should have refreshed the data.

We are using: cloudera 5.13.2.1 hive 1.1.0 spark 2.3.0 hadoop 2.6.0 scala 2.11.8 java 1.8.0_144

Show create table

|CREATE EXTERNAL TABLE Columns and type
PARTITIONED BY (`year` int COMMENT '*', `month` int COMMENT '*', `day` int COMMENT '*', `a` string COMMENT '*', `b` string COMMENT '*')
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = '1'
)
STORED AS
 INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://path'
TBLPROPERTIES (
 'transient_lastDdlTime' = '1559029332'
)
|
3
A couple of details would be helpful: Scala, Spark, Hive versions, exact table DDL, how was table created (Hive or Spark), how was data inserted, anything else you feel could be relevant?Kombajn zbożowy
@Kombajnzbożowy Some files were created with Hive, others were copied from one server and pasted to our server, then they should have refreshed the data. We are using cloudera 5.13.2.1, hive 1.1.0, spark 2.3.0, hadoop 2.6.0, scala 2.11.8 and java 1.8.0_144Pablo López Gallego
@Kombajnzbożowy what is the file format of underlying table? whats the value of following configs - "spark.sql.hive.metastorePartitionPruning", "spark.sql.hive.convertMetastoreParquet", "spark.sql.hive.convertMetastoreOrc"? Also can you paste the output of "show create table db.table" ? The partitioning pruining logic of hive tables in Spark depends on the above mentioned configs.moriarty007
@prakharjain, thaks, this is the problem, you need to set spark.sql.hive.convertMetastoreParquet=false, write the answer and I will validate it.Pablo López Gallego

3 Answers

6
votes

A parquet hive table in Spark can use following 2 read flows -

  1. Hive flow - This will be used when spark.sql.hive.convertMetastoreParquet is set to false. For partitioning pruining to work in this case, you have to set spark.sql.hive.metastorePartitionPruning=true.

    spark.sql.hive.metastorePartitionPruning: When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information

  2. Datasource flow - This flow by default has partition pruning turned on.

1
votes

This can happen when metastore does not have the partition values for the partition column. Can we run from Spark

ALTER TABLE db.table RECOVER PARTITIONS

And then rerun the same query.

-1
votes

You will not be able to read special partition in a table you don't have access to all its partition using Spark-Hive API. Spark is using a Hive table access permission and in Hive you need to take full access to the table.

The reason you can't treat spark-hive as unix access. If you need to do it use spark.csv (or whatever format). Then read the data as file based.

You can simply use spark.csv.read("/path-to-table/table/year=2019/month=2/day=27/a=w/b=x/part-")

If you need to verify my answer, Ignore spark and try to run the same query in Hive shell it will not work as part of hive configurations.