0
votes

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs

in my code, i have a dataframe which is read directly from hdfs:

df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")

when i use df.show(n=2) directly in my code after the above code, it outputs:

+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+

But when i manually go to the hdfs path, data is not empty.

What i have tried?

1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.

2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist

What i am assuming?

1- i think this may have something to do with drivers and executors

2- it may i have something to do with yarn

3- configs provided when using spark-submit

current config:

spark-submit \
    --master yarn \
    --queue my_queue_name \
    --deploy-mode cluster \
    --jars some_jars \
    --conf spark.yarn.dist.files some_files \
    --conf spark.sql.catalogImplementation=in-memory \
    --properties-file some_zip_file \
    --py-files some_py_files \
    main.py

What i am sure

data is not empty. the same hdfs path is provided in another project which is working fine.

1
wrong schema maybe? - Lamanus
no that's correct too - Ava
Just to confirm can you run the same code without providing the schema and set inferschema to true and see if the df.show is giving output. - Joby
it's not giving output:( - Ava

1 Answers

2
votes

So the problem was with the jar files i was providing

The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine