3
votes

While reading a parquet file stored in hadoop with either scala or pyspark an error occurs:

#scala    
var dff = spark.read.parquet("/super/important/df")
    org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
      at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
      at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
  ... 52 elided

or

sql_context.read.parquet(output_file)

results in the same error.

Error message is pretty clear about what has to be done: Unable to infer schema for Parquet. It must be specified manually.;. But where can I specify it?

Spark 2.1.1, Hadoop 2.5, dataframes are created with a help of pyspark. Files are partitioned into 10 peaces.

2
Can you try this var dff = spark.read.parquet("/super/important/df").toDF("ColumnName1,"ColumnName2") - Bhavesh

2 Answers

1
votes

I have done a quick implementation for the same

enter image description here

Hope this Helps!!...

1
votes

This error usually occurs when you try to read an empty directory as parquet. If for example you create an empty DataFrame, you write it in parquet and then read it, this error appears. You could check if the DataFrame is empty with rdd.isEmpty() before write it.