I have a problem using spark 2.1.1 and hadoop 2.6 on Ambari. I tested my code on my local computer first (single node, local files) and everything works as expected:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('yarn')\
.appName('localTest')\
.getOrCreate()
data = spark.read.format('orc').load('mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()
+-------+------------------+
|summary| colname |
+-------+------------------+
| count| 1688264|
| mean|17.963293650793652|
| stddev|5.9136724822401425|
| min| 0.5|
| max| 87.5|
+-------+------------------+
These values are toally plausible.
Now I uploaded my data to a hadoop cluster (ambari setup, yarn, 11 nodes) and pushed it into the hdfs using hadoop fs -put /home/username/mydata /mydata
Now I tested the same code which ended with the following table:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('yarn')\
.appName('localTest')\
.getOrCreate()
data = spark.read.format('orc').load('hdfs:///mydata/*.orc')
data.select('colname').na.drop().describe(['colname']).show()
+-------+------------------+
|summary| colname |
+-------+------------------+
| count| 2246009|
| mean|1525.5387403802445|
| stddev|16250.611372902456|
| min| -413050.0|
| max| 1.6385821E7|
+-------+------------------+
But another thing is confusing completly me -> if I change mydata/*.orc
to mydata/any_single_file.orc
and hdfs:///mydata/*.orc
to hdfs:///mydata/any_single_file.orc
both tables (cluster, local pc) are the same ...
Does anyone know more about this weird behaviour?
Thanks a lot!