0
votes

We are using pyspark submit option using jar files placed in hdfs and yarn cluster mode like below.

#######################

source *****.sh                                    # setting environment variables
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

/bin/pyspark --master yarn --deploy-mode client --driver-memory *g  --executor-memory *g --executor-cores * --num-executors * --conf spark.yarn.archive=hdfs://node-master:/***/spark-libs.jar  --jars hdfs://node-master:/***/spark-sftp_2.11-1.1.5.jar

############################ Spark code ###########################

conf = spark.sparkContext._conf.setAll(
            [('spark.app.name', '***'),
             ("spark.sql.execution.arrow.enabled", "true"),
             ("spark.hadoop.fs.sftp.impl", "org.apache.hadoop.fs.sftp.SFTPFileSystem")])

spark = SparkSession \
            .builder \
            .config(conf=conf) \
            .getOrCreate()
  
sc = spark.sparkContext

SFTP_HOST= 'host'
SFTP_USER = 'user'
SFTP_PEM = 'passkey'
file_path ='full_path'

spark.read\
     .format("com.springml.spark.sftp")\
     .option("host", SFTP_HOST)\
     .option("username", SFTP_USER)\
     .option("pem", SFTP_PEM)\
     .option("fileType", "csv")\
     .option("multiLine", "true")\
     .option("header", "true")\
     .load(file_path)

################################## error #############################

Py4JJavaError: An error occurred while calling o52.load. : java.lang.NoClassDefFoundError: com/springml/sftp/client/SFTPClient

Questions
  1. How to resolve this issue is there anything we are skipping.
  2. Is it the right method to use jar file for spark-sftp. I can't use --package as it will install the dependent files and we are avoiding it.

Also we tried with cluster mode and followed below links but still in same page.

https://github.com/springml/spark-sftp/blob/master/README.md Use spark-sftp jar in pyspark

1

1 Answers

1
votes

spark-sftp_2.11-1.1.5.jar is not a uber/fat jar that means it doesn't include the dependent packages while building the artifact. You should add the missing dependent packages to the classpath for this to work.

Checkout the compile dependencies here:

https://mvnrepository.com/artifact/com.springml/spark-sftp_2.11/1.1.5

Alternatively, you can also checkout https://github.com/arcizon/spark-filetransfer if interested which is developed by me to overcome few different issues I had faced with the spark-sftp package like missing support for 2.12 and restrictions to DataFrame API options for a fileType.