Apache Parquet Could not read footer: java.io.IOException:

Question

I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this project on a school cluster but there I am having problems while reading the parquet file at this part of code:

DataFrame schemaRDF = sqlContext.parquetFile("/var/tmp/graphs/sib200.parquet");

I get the following error:

Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)

Based on the search online it seems to be a parquet version problem.

What I would like from you is to tell me how can I find the installed parquet version in a computer in order to check if both have the same version. Or in addition, if you know the exact solution for this error would also be perfect!

If it was any other format, say .csv, you should specify format="com.databricks.spark.csv", while you are reading. — pnv

Bruno Faria Bruno Faria · Accepted Answer · 2019-02-17T16:33:57

I got the same problem trying to read a parquet file from S3. In my case the issue was the required libraries were not available for all workers in the cluster.

There are 2 ways to fix that:

Make sure you added the dependencies on the spark-submit command so it's distributed to the whole cluster
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster.

Apache Parquet Could not read footer: java.io.IOException:

4 Answers