Spark parquet reading error

Question

I am working on a Spark project, Here i had one file which is in parquet format when I try to load this file using java it gives me the below error. But when I loaded the same file in hive with the same path and write a query select * from table_name, so its working fine and data is also coming properely. Please help me regarding this issue.

java.io.IOException: Could not read footer: java.lang.RuntimeException: corrupted file: the footer index is not within the file at org.apache.parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:247) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:754) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:743) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:710) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: corrupted file: the footer index is not within the file at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:237) at org.apache.parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:233) at java.util.concurrent.FutureTask.run(FutureTask.java:262)

Mohit.kc Mohit.kc · Accepted Answer · 2018-01-11T13:47:05

You can try below options

1)  sqlContext.read.parquet("path") 

2) sqlContext.read.format(fileFormat)
.option("header", header) // Use first line of all files as header
.option("inferSchema", inferSchema) // Automatically infer data types
.load(source)

If your issue didn't resolved, please post the sample of code.

Spark parquet reading error

1 Answers