I have a SPARK project running on a Cloudera VM. On my project I load the data from a parquet file and then process these data. Everything works fine but The problem is that I need to run this project on a school cluster but there I am having problems while reading the parquet file at this part of code:
DataFrame schemaRDF = sqlContext.parquetFile("/var/tmp/graphs/sib200.parquet");
I get the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/var/tmp/graphs/sib200.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:248) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$28.apply(ParquetRelation.scala:750)
Based on the search online it seems to be a parquet version problem.
What I would like from you is to tell me how can I find the installed parquet version in a computer in order to check if both have the same version. Or in addition, if you know the exact solution for this error would also be perfect!
.csv, you should specifyformat="com.databricks.spark.csv", while you are reading. - pnv