Implausibly spark dataframe after read ocr files from hdfs

Question

I have a problem using spark 2.1.1 and hadoop 2.6 on Ambari. I tested my code on my local computer first (single node, local files) and everything works as expected:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .master('yarn')\
    .appName('localTest')\
    .getOrCreate()

data = spark.read.format('orc').load('mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()

+-------+------------------+
|summary| colname          |
+-------+------------------+
|  count|           1688264|
|   mean|17.963293650793652|
| stddev|5.9136724822401425|
|    min|               0.5|
|    max|              87.5|
+-------+------------------+

These values are toally plausible.

Now I uploaded my data to a hadoop cluster (ambari setup, yarn, 11 nodes) and pushed it into the hdfs using hadoop fs -put /home/username/mydata /mydata

Now I tested the same code which ended with the following table:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .master('yarn')\
    .appName('localTest')\
    .getOrCreate()

data = spark.read.format('orc').load('hdfs:///mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()

+-------+------------------+
|summary| colname          |
+-------+------------------+
|  count|           2246009|
|   mean|1525.5387403802445|
| stddev|16250.611372902456|
|    min|         -413050.0|
|    max|       1.6385821E7|
+-------+------------------+

But another thing is confusing completly me -> if I change mydata/*.orc to mydata/any_single_file.orc and hdfs:///mydata/*.orc to hdfs:///mydata/any_single_file.orc both tables (cluster, local pc) are the same ...

Does anyone know more about this weird behaviour?

Thanks a lot!

Have you checked if the count of files are the same on hdfs and on your local machine? On HDFS it seems to be more than on the local machine... — Dat Tran
It's the same .. I tried an old Spark 1.6 version from ambari current folder and there it seems to work :/ — Fabian
I would recommend to look at the data with take(10) or so, and see if anything's off... — Rick Moritz

Fabian Fabian · Accepted Answer · 2017-07-25T05:25:50

After a week of searching the "solution" for me was that in some files the schema was a little bit different (a column more or less) and while there is a schema merge implemented in parquet, orc does not support a schema merge for now.. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11412

So my workaround was to load the orc files one after another and then I used the df.write.parquet() method to convert them. After the conversion was finished. I could load them all together using *.parquet instead of *.orc in the file path.

Implausibly spark dataframe after read ocr files from hdfs

1 Answers