3
votes

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.

However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.

Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.

enter image description here enter image description here

enter image description here

Update :

For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.

Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB

enter image description here

enter image description here

2
what happens if you replace "where movieid in (10,12)" with "where rating=0" in the query to hive-managed table? I have a suspicion that "PartitionFilters: ..." will disappear in that case. Probably the filter is not invoked if you include only a specific partition in "where" clausemangusta
ah, the partition in hive-managed table is "movieid", not "rating", right? my reasoning was wrong thenmangusta
@mangusta I understand the confusion will update the question with detailsDave
Can you try where clause value as string instead int - like where rating=‘0’SMaZ
A little unclear herethebluephantom

2 Answers

2
votes

tl; dr set the following before the running sql for external table spark.sql("set spark.sql.hive.convertMetastoreOrc=true")

The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)

Where the table was created(Hive or Spark)

If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.

How spark uses this information

When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.

Fileformat

Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet Flags:

Orc

  val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
    .doc("When set to true, the built-in ORC reader and writer are used to process " +
      "ORC tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)

Parquet

val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
    .doc("When set to true, the built-in Parquet reader and writer are used to process " +
      "parquet tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)
-1
votes

I also ran into this kind of problem having multiple joins of internal and external tables.

None of the tricks work including:

    spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
    spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
    spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")

anyone who knows how to solve this problem.