0
votes

I have a c# application that creates and upload parquet files to a remote HDFS. If I copy the file using scp to a target machine with a HDFS client installed and then "hdfs put" the file into HDFS, spark can read the file correctly.

If I upload the file directly to HDFS from the client application using curl against webhdf services, I get the following error from Spark when trying to read the parquet file:

df = sqlContext.read.parquet("/tmp/test.parquet") Traceback (most recent call last): File "", line 1, in File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 303, in parquet return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

If I extract both files (the scp uploaded one and the file uploaded via curl) to the local filesystem and make a binary diff between the files, the diff does not show any difference. If I put the file again into HDFS (the one that was uploaded using curl and webhdfs), then Spark can read well the parquet file.

It's something like the "hdfs put" made some kind of magic to make spark read well the parquet file.

What could be happening? Thanks

UPDATE: If I get to local a directory with several parquets and put it at once again into HDFS, it does not work, I have to put the parquet files one by one to make spark read them

2

2 Answers

0
votes

Did you check if the webhdfs services puts the file under the same path (/tmp/test.parquet)? In other words, can you download the file (that has been uploaded via webhdfs) with the hdfs client (hdfs get)?

Bests, fej

0
votes

I finally figured out the reason of the error. Name of the uploaded files started with "_" character. It was the reason why spark wasn't able to load the parquet file.