Unable to infer schema when loading Parquet file

36

votes

response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

But then:

outcome2 = sqlc.read.parquet(response)  # fail

fails with:

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

in

/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?

Using Spark 2.1.1. Also fails in 2.2.0.

Found this bug report, but was fixed in 2.0.1, 2.1.0.

UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".

apache-sparkpysparkparquet

57

votes

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.

9

votes

In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.

See also:

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

7

votes

I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket). After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).

Glue was trying to apply data catalog table schema on a file which doesn't exist.

After copying file into s3 bucket file location, issue got resolved.

Hope this helps someone who encounters/encountered an error in AWS Glue.

5

votes

This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.

Besides that with parquet, the same thing happens with ORC.

4

votes

Just to emphasize @Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename

val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
         .load("/Users/myuser/_HEADER_0")

org.apache.spark.sql.AnalysisException: 
Unable to infer schema for CSV. It must be specified manually.;

Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)

2

votes

In my case, the error occurred because the filename contained underscores. Rewriting / reading the file without underscores (hyphens were OK) solved the problem...

2

votes

I see there are already so many Answers. But the issue I faced was my Spark job was trying to read a file which is being overwritten by another Spark job that was previously started. It sounds bad, but I did that mistake.

2

votes

For me this happened when I thought loading the correct file path but instead pointed a incorrect folder

1

votes

I ran into a similar problem with reading a csv

spark.read.csv("s3a://bucket/spark/csv_dir/.")

gave an error of:

org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;

I found if I removed the trailing . and then it works. ie:

spark.read.csv("s3a://bucket/spark/csv_dir/")

I tested this for parquet adding a trailing . and you get an error of:

org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

0

votes

As others mentioned, in my case this error appeared when I was reading S3 keys that did not exist. A solution is filter-in keys that do exist:

import com.amazonaws.services.s3.AmazonS3URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
import java.net.URI

def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String = {
  val uri = new URI(url)
  val hostWithEndpoint = uri.getHost + "." + domain
  new URI(uri.getScheme, uri.getUserInfo, hostWithEndpoint, uri.getPort, uri.getPath, uri.getQuery, uri.getFragment).toString
}

def createS3URI(url: String): AmazonS3URI = {
  try {
    // try to instantiate AmazonS3URI with url
    new AmazonS3URI(url)
  } catch {
    case e: IllegalArgumentException if e.getMessage.
      startsWith("Invalid S3 URI: hostname does not appear to be a valid S3 endpoint") => {
      new AmazonS3URI(addEndpointToUrl(url))
    }
  }
}

def s3FileExists(spark: SparkSession, url: String): Boolean = {
  val amazonS3Uri: AmazonS3URI = createS3URI(url)
  val s3BucketUri = new URI(s"${amazonS3Uri.getURI().getScheme}://${amazonS3Uri.getBucket}")

  FileSystem
    .get(s3BucketUri, spark.sparkContext.hadoopConfiguration)
    .exists(new Path(url))
}

and you can use it as:

val partitions = List(yesterday, today, tomorrow)
  .map(f => somepath + "/date=" + f)
  .filter(f => s3FileExists(spark, f))

val df = spark.read.parquet(partitions: _*)

For that solution I took some code out of spark-redshift project, here.

0

votes

You are just loading a parquet file , Of course parquet had valid schema. Otherwise it would not been saved as parquet. This error means -

Either parquet file does not exist . (99.99% cases this is the issue. Spark error messages are often less obvious)
Somehow the parquet file got corrupted or Or It's not a parquet file at all

0

votes

I ran into this issue because of folder in folder issue.

for example folderA.parquet was supposed to have partion.... but instead it have folderB.parquet which inside have partition.

Resolution, transfer the file to parent folder and delete the subfolder.

0

votes

I just encountered the same problem but none of the solutions here work for me. I try to merge the row groups of my parquet files on hdfs by first reading them and write it to another place using:

df = spark.read.parquet('somewhere')
df.write.parquet('somewhere else')

But later when I query it with

spark.sql('SELECT sth FROM parquet.`hdfs://host:port/parquetfolder/` WHERE .. ')

It shows the same problem. I finally solve this by using pyarrow:

df = spark.read.parquet('somewhere')
pdf = df.toPandas()
adf = pa.Table.from_pandas(pdf)   # import pyarrow as pa
fs = pa.hdfs.connect()
fw = fs.open(path, 'wb')
pq.write_table(adf, fw)           # import pyarrow.parquet as pq
fw.close()

0

votes

you can read with /*

outcome2 = sqlc.read.parquet(f"{response}/*")  # work for me

Unable to infer schema when loading Parquet file

14 Answers