I have created an ORC table in Hive with partitions.The data is loaded in HDFS using Apache pig in ORC format. Then Hive table is created on top of that. Partition columns are year,month and day. When i tried to read that table using spark sql , i am getting array out of bound exception. Please find below the code and error message.
Code:
myTable = spark.table("testDB.employee")
myTable.count()
Error:
ERROR Executor: Exception in task 8.0 in stage 10.0 (TID 66) java.lang.IndexOutOfBoundsException: toIndex = 47
The datatypes in this table are String,timestamp & double. When i tried to select all the columns using select statement with the spark sql query, i am getting class cast exception as given below.
py4j.protocol.Py4JJavaError: An error occurred while calling o536.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 84, localhost, executor driver): java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
After this i tried to cast to timestamp using the snippet code given below. But after that also i am getting the array out of bound exception.
df2 = df.select('dt',unix_timestamp('dt', "yyyy-MM-dd HH:mm:ss") .cast(TimestampType()).alias("timestamp"))