1
votes

I am trying to read the local file in client mode on Yarn framework. I was not able to access the local file in client mode also.

import os
import pyspark.sql.functions as F
from os import listdir, path

from pyspark import SparkConf, SparkContext

import argparse
from pyspark import SparkFiles
from pyspark.sql import SparkSession

def main():
    spark = SparkSession \
    .builder \
    .appName("Spark File load example") \
    .config("spark.jars","/u/user/someuser/sqljdbc4.jar") \
    .config("spark.dynamicAllocation.enabled","true") \
    .config("spark.shuffle.service.enabled","true") \
    .config("hive.exec.dynamic.partition", "true") \
    .config("hive.exec.dynamic.partition.mode", "nonstrict") \
    .config("spark.sql.shuffle.partitions","50") \
    .config("hive.metastore.uris", "thrift://******.hpc.****.com:9083") \
    .enableHiveSupport() \
    .getOrCreate()

    spark.sparkContext.addFile("/u/user/vikrant/testdata/EMPFILE1.csv")


    inputfilename=getinputfile(spark)
    print("input file path is:",inputfilename)
    data = processfiledata(spark,inputfilename)
    data.show()
    spark.stop()

def getinputfile(spark):

    spark_files_dir = SparkFiles.getRootDirectory()
    print("spark_files_dir:",spark_files_dir)
    inputfile = [filename
                   for filename in listdir(spark_files_dir)
                   if filename.endswith('EMPFILE1.csv')]
    if len(inputfile) != 0:
        path_to_input_file = path.join(spark_files_dir, inputfile[0])
    else:
        print("file path not found",path_to_input_file)

    print("inputfile name:",inputfile)
    return path_to_input_file


    def processfiledata(spark,inputfilename):

        dataframe= spark.read.format("csv").option("header","false").load(inputfilename)
        return dataframe

if __name__ == "__main__":
     main()

Below is my shell script-->
    spark-submit --master yarn --deploy-mode client PysparkMainModulenew.py --files /u/user/vikrant/testdata/EMPFILE1.csv

Below is the error message-->

('spark_files_dir:', u'/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2') ('inputfile name:', [u'EMPFILE1.csv']) ('input file path is:', u'/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2/EMPFILE1.csv') Traceback (most recent call last): File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 57, in main() File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 31, in main data = processfiledata(spark,inputfilename) File "/u/user/vikrant/testdata/PysparkMainModulenew.py", line 53, in processfiledata dataframe = spark.read.format("csv").option("header","false").load(inputfilename) File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in call File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://hdd2cluster/h/tmp/spark-76bdbd48-cbb4-4e8f-971a-383b899f79b0/userFiles-ee6dcdec-b320-433b-8491-311927c75fe2/EMPFILE1.csv;'

4
Refer to local files using file://Sudev Ambadi

4 Answers

1
votes

You have something like this. This won't work because you need to put PysparkMainModulenew.py after --files option. So, this

spark-submit --master yarn --deploy-mode client PysparkMainModulenew.py --files /u/user/vikrant/testdata/EMPFILE1.csv

Should be,

spark-submit --master yarn --deploy-mode client --files /u/user/vikrant/testdata/EMPFILE1.csv PysparkMainModulenew.py

And, No need to use addFile in that case. You can copy both PysparkMainModulenew.py and EMPFILE1.csv to the same folder. And, everything should be after --files option.

spark-submit --master yarn --deploy-mode client --files /u/user/vikrant/testdata/EMPFILE1.csv /u/user/vikrant/testdata/PysparkMainModulenew.py

Alternatively, you can use --py-files option too.

0
votes

You can read local file only in "local" mode. If you cant to read local file in "yarn" mode then that file has to be present on all data nodes, So that when container get initiated on any of data node that file would be available to the container on that data node.

IMHO It's always better to mention technology stack version(s) and Hadoop distribution you are using in order to get swift help.

0
votes

Your default path is might be HDFS home path so for getting file from local machine you have to add file:// in path.

df=spark.read.format("csv").option("header","false").load("file:///home/inputfilename")
-1
votes

df= sqlContext.read.format("csv").option("header","true").load(file:///home/inputfilename)