df.show() prints empty result while in hdfs it is not empty

Question

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs

in my code, i have a dataframe which is read directly from hdfs:

df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")

when i use df.show(n=2) directly in my code after the above code, it outputs:

+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+

But when i manually go to the hdfs path, data is not empty.

What i have tried?

1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.

2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist

What i am assuming?

1- i think this may have something to do with drivers and executors

2- it may i have something to do with yarn

3- configs provided when using spark-submit

current config:

spark-submit \
    --master yarn \
    --queue my_queue_name \
    --deploy-mode cluster \
    --jars some_jars \
    --conf spark.yarn.dist.files some_files \
    --conf spark.sql.catalogImplementation=in-memory \
    --properties-file some_zip_file \
    --py-files some_py_files \
    main.py

What i am sure

data is not empty. the same hdfs path is provided in another project which is working fine.

Just to confirm can you run the same code without providing the schema and set inferschema to true and see if the df.show is giving output. — Joby

Ava Ava · Accepted Answer · 2020-03-08T13:51:17

So the problem was with the jar files i was providing

The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine

df.show() prints empty result while in hdfs it is not empty

1 Answers