7
votes

I have a problem using spark 2.1.1 and hadoop 2.6 on Ambari. I tested my code on my local computer first (single node, local files) and everything works as expected:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .master('yarn')\
    .appName('localTest')\
    .getOrCreate()

data = spark.read.format('orc').load('mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()

+-------+------------------+
|summary| colname          |
+-------+------------------+
|  count|           1688264|
|   mean|17.963293650793652|
| stddev|5.9136724822401425|
|    min|               0.5|
|    max|              87.5|
+-------+------------------+

These values are toally plausible.

Now I uploaded my data to a hadoop cluster (ambari setup, yarn, 11 nodes) and pushed it into the hdfs using hadoop fs -put /home/username/mydata /mydata

Now I tested the same code which ended with the following table:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .master('yarn')\
    .appName('localTest')\
    .getOrCreate()

data = spark.read.format('orc').load('hdfs:///mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()

+-------+------------------+
|summary| colname          |
+-------+------------------+
|  count|           2246009|
|   mean|1525.5387403802445|
| stddev|16250.611372902456|
|    min|         -413050.0|
|    max|       1.6385821E7|
+-------+------------------+

But another thing is confusing completly me -> if I change mydata/*.orc to mydata/any_single_file.orc and hdfs:///mydata/*.orc to hdfs:///mydata/any_single_file.orc both tables (cluster, local pc) are the same ...

Does anyone know more about this weird behaviour?

Thanks a lot!

1
Have you checked if the count of files are the same on hdfs and on your local machine? On HDFS it seems to be more than on the local machine...Dat Tran
It's the same .. I tried an old Spark 1.6 version from ambari current folder and there it seems to work :/Fabian
I would recommend to look at the data with take(10) or so, and see if anything's off...Rick Moritz

1 Answers

0
votes

After a week of searching the "solution" for me was that in some files the schema was a little bit different (a column more or less) and while there is a schema merge implemented in parquet, orc does not support a schema merge for now.. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11412

So my workaround was to load the orc files one after another and then I used the df.write.parquet() method to convert them. After the conversion was finished. I could load them all together using *.parquet instead of *.orc in the file path.