Spark Sql to read from Hive orc partitioned table giving array out of bound exception

Question

I have created an ORC table in Hive with partitions.The data is loaded in HDFS using Apache pig in ORC format. Then Hive table is created on top of that. Partition columns are year,month and day. When i tried to read that table using spark sql , i am getting array out of bound exception. Please find below the code and error message.

Code:

myTable = spark.table("testDB.employee")
myTable.count()

Error:

ERROR Executor: Exception in task 8.0 in stage 10.0 (TID 66) java.lang.IndexOutOfBoundsException: toIndex = 47

The datatypes in this table are String,timestamp & double. When i tried to select all the columns using select statement with the spark sql query, i am getting class cast exception as given below.

py4j.protocol.Py4JJavaError: An error occurred while calling o536.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 84, localhost, executor driver): java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

After this i tried to cast to timestamp using the snippet code given below. But after that also i am getting the array out of bound exception.

df2 = df.select('dt',unix_timestamp('dt', "yyyy-MM-dd HH:mm:ss") .cast(TimestampType()).alias("timestamp"))

what version of spark are you using? check this jira.apache.org/jira/browse/SPARK-24472 — Ravi
The older spark versions have this same bug. I attached the link to the ticket. Try upgrading to spark 2.4 — Ravi
I am able to read from other table which is also an ORC table and it has only 1 partition. But for this table only i am facing this issue which have many partitions based on day. So is there a chance for this issue might be due to partitions in the table? — Amrutha K
possible yes. When you say one partition, means there is actually no partitions. There is a single directory which has all your data. Multiple partitions will have many directories. — Ravi

Wenrui Meng Wenrui Meng · Accepted Answer · 2019-07-24T23:40:31

If you don't specify the partition filter, it could cause this problem. On my side, when I specify the date beween filter, it resolves this out of bound exception.

Spark Sql to read from Hive orc partitioned table giving array out of bound exception

1 Answers